Torch Agent implements much of the boilerplate necessary for creating a neural dialogue agent, so you can focus on modeling. Torch Agent limits its functionality to maintaining dialogue history, transforming text into vectors of indicies, and loading/saving models. The user is required to implement their own logic in methods like train_step and eval_step.

Torch Ranker Agent and Torch Generator have more specialized stub methods, and provide many rich features and benefits. Torch Ranker Agent assumes your model ranks possible responses from a set of possible candidates, and provides options around negative sampling, candidate sampling, and large-scale candidate prediction. Torch Generator Agent assumes your model generates utterances auto-regressively, and provides generic implementations of beam search.

Torch Agent

General utility code for building PyTorch-based agents in ParlAI.

Contains the following main utilities:

  • TorchAgent class which serves as a useful parent class for other model agents

  • Batch namedtuple which is the input type of the main abstract methods of the TorchAgent class

  • Output namedtuple which is the expected output type of the main abstract methods of the TorchAgent class

See below for documentation on each specific tool.

class parlai.core.torch_agent.Batch(text_vec=None, text_lengths=None, label_vec=None, label_lengths=None, labels=None, valid_indices=None, candidates=None, candidate_vecs=None, image=None, observations=None, **kwargs)

Bases: parlai.core.utils.AttrDict

Batch is a namedtuple containing data being sent to an agent.

This is the input type of the train_step and eval_step functions. Agents can override the batchify function to return an extended namedtuple with additional fields if they would like, though we recommend calling the parent function to set up these fields as a base.

  • text_vec – bsz x seqlen tensor containing the parsed text data.

  • text_lengths – list of length bsz containing the lengths of the text in same order as text_vec; necessary for pack_padded_sequence.

  • label_vec – bsz x seqlen tensor containing the parsed label (one per batch row).

  • label_lengths – list of length bsz containing the lengths of the labels in same order as label_vec.

  • labels – list of length bsz containing the selected label for each batch row (some datasets have multiple labels per input example).

  • valid_indices – list of length bsz containing the original indices of each example in the batch. we use these to map predictions back to their proper row, since e.g. we may sort examples by their length or some examples may be invalid.

  • candidates – list of lists of text. outer list has size bsz, inner lists vary in size based on the number of candidates for each row in the batch.

  • candidate_vecs – list of lists of tensors. outer list has size bsz, inner lists vary in size based on the number of candidates for each row in the batch.

  • image – list of image features in the format specified by the –image-mode arg.

  • observations – the original observations in the batched order

__init__(text_vec=None, text_lengths=None, label_vec=None, label_lengths=None, labels=None, valid_indices=None, candidates=None, candidate_vecs=None, image=None, observations=None, **kwargs)

Initialize AttrDict using input dict.

class parlai.core.torch_agent.Output(text=None, text_candidates=None, **kwargs)

Bases: parlai.core.utils.AttrDict

Output is an object containing agent predictions.

This is the expected return type of the train_step and eval_step functions, though agents can choose to return None if they do not want to answer.

  • text (List[str]) – list of strings of length bsz containing the predictions of the model

  • text_candidates (List[List[str]]) – list of lists of length bsz containing ranked predictions of the model. each sub-list is an ordered ranking of strings, of variable length.

__init__(text=None, text_candidates=None, **kwargs)

Initialize AttrDict using input dict.

class parlai.core.torch_agent.History(opt, field='text', vec_type='deque', maxlen=None, size=-1, p1_token='__p1__', p2_token='__p2__', dict_agent=None)

Bases: object

History handles tracking the dialogue history/state over the course of an episode.

History may also be used to track the history of any field.

  • field – field in the observation to track over the course of the episode (defaults to ‘text’)

  • vec_type – specify a ‘list’ or ‘deque’ to save the history in this object

  • maxlen – if vec_type is ‘deque’, this sets the maximum length of that object

  • p1_token – token indicating ‘person 1’; opt must have ‘person_tokens’ set to True for this to be added

  • p1_token – token indicating ‘person 2’; opt must have ‘person_tokens’ set to True for this to be added

  • dict_agent – DictionaryAgent object for tokenizing the history

__init__(opt, field='text', vec_type='deque', maxlen=None, size=-1, p1_token='__p1__', p2_token='__p2__', dict_agent=None)

Initialize self. See help(type(self)) for accurate signature.


Tokenize text with the given dictionary.


Clear the history.

update_history(obs, add_next=None)

Update the history with the given observation.


add_next – string to append to history prior to updating it with the observation


Returns the string version of the history.


Returns a vectorized version of the history.


Returns a list of history vecs.

class parlai.core.torch_agent.TorchAgent(opt, shared=None)

Bases: parlai.core.agents.Agent

A provided base agent for any model that wants to use Torch.

Exists to make it easier to implement a new agent. Not necessary, but reduces duplicated code.

Many methods are intended to be either used as is when the default is acceptable, or to be overriden and called with super(), with the extra functionality added to the initial result. See the method comment for recommended behavior.

This agent serves as a common framework for all ParlAI models which want to use PyTorch.

classmethod optim_opts()

Fetch optimizer selection.

By default, collects everything in torch.optim, as well as importing: - qhm / qhmadam if installed from

Override this (and probably call super()) to add your own optimizers.

static dictionary_class()

Return the dictionary class that this agent expects to use.

Can be overriden if a more complex dictionary is required.

classmethod history_class()

Return the history class that this agent expects to use.

Can be overriden if a more complex history is required.

classmethod add_cmdline_args(argparser)

Add the default commandline args we expect most agents to want.

__init__(opt, shared=None)

Initialize agent.


Return the constructed dictionary, which will be set to self.dict.

If you need to add additional tokens to the dictionary, this is likely the right place to do it.

init_optim(params, optim_states=None, saved_optim_type=None)

Initialize optimizer with model parameters.

  • params – parameters from the model

  • optim_states – optional argument providing states of optimizer to load

  • saved_optim_type – type of optimizer being loaded, if changed will skip loading optimizer states

build_lr_scheduler(states=None, hard_reset=False)

Create the learning rate scheduler, and assign it to self.scheduler. This scheduler will be updated upon a call to receive_metrics.

May also create self.warmup_scheduler, if appropriate.

  • states (state_dict) – Possible state_dict provided by model checkpoint, for restoring LR state

  • hard_reset (bool) – If true, the LR scheduler should ignore the state dictionary.


Use the metrics to decide when to adjust LR schedule.

This uses the loss as the validation metric if present, if not this function does nothing. Note that the model must be reporting loss for this to work.

Override this to override the behavior.


Share fields from parent as well as useful objects in this class.

Subclasses will likely want to share their model as well.

vectorize(obs, history, add_start=True, add_end=True, text_truncate=None, label_truncate=None)

Make vectors out of observation fields and store in the observation.

In particular, the ‘text’ and ‘labels’/’eval_labels’ fields are processed and a new field is added to the observation with the suffix ‘_vec’.

If you want to use additional fields on your subclass, you can override this function, call super().vectorize(…) to process the text and labels, and then process the other fields in your subclass.

Additionally, if you want to override some of these default parameters, then we recommend using a pattern like:

def vectorize(self, *args, **kwargs):
    kwargs['add_start'] = False
    return super().vectorize(*args, **kwargs)
  • obs – Single observation from observe function.

  • add_start – default True, adds the start token to each label.

  • add_end – default True, adds the end token to each label.

  • text_truncate – default None, if set truncates text vectors to the specified length.

  • label_truncate – default None, if set truncates label vectors to the specified length.


the input observation, with ‘text_vec’, ‘label_vec’, and ‘cands_vec’ fields added.

batchify(obs_batch, sort=False)

Create a batch of valid observations from an unchecked batch.

A valid observation is one that passes the lambda provided to the function, which defaults to checking if the preprocessed ‘text_vec’ field is present which would have been set by this agent’s ‘vectorize’ function.

Returns a namedtuple Batch. See original definition above for in-depth explanation of each field.

If you want to include additonal fields in the batch, you can subclass this function and return your own “Batch” namedtuple: copy the Batch namedtuple at the top of this class, and then add whatever additional fields that you want to be able to access. You can then call super().batchify(…) to set up the original fields and then set up the additional fields in your subclass and return that batch instead.

  • obs_batch – List of vectorized observations

  • sort – Default False, orders the observations by length of vectors. Set to true when using torch.nn.utils.rnn.pack_padded_sequence. Uses the text vectors if available, otherwise uses the label vectors if available.

match_batch(batch_reply, valid_inds, output=None)

Match sub-batch of predictions to the original batch indices.

Batches may be only partially filled (i.e when completing the remainder at the end of the validation or test set), or we may want to sort by e.g the length of the input sequences if using pack_padded_sequence.

This matches rows back with their original row in the batch for calculating metrics like accuracy.

If output is None (model choosing not to provide any predictions), we will just return the batch of replies.

Otherwise, output should be a parlai.core.torch_agent.Output object. This is a namedtuple, which can provide text predictions and/or text_candidates predictions. If you would like to map additional fields into the batch_reply, you can override this method as well as providing your own namedtuple with additional fields.

  • batch_reply – Full-batchsize list of message dictionaries to put responses into.

  • valid_inds – Original indices of the predictions.

  • output – Output namedtuple which contains sub-batchsize list of text outputs from model. May be None (default) if model chooses not to answer. This method will check for text and text_candidates fields.


Retrieve the last reply from the model.

If available, we use the true label instead of the model’s prediction.

By default, batch_act stores the batch of replies and this method will extract the reply of the current instance from the batch.


use_label – default true, use the label when available instead of the model’s generated response.


Get the model’s predicted reply history within this episode.


batch – (default False) return the reply history for every row in the batch, otherwise will return just for this example.


list of lists of strings, each of the past model replies in in the current episode. will be None wherever model did not reply.


Process incoming message in preparation for producing a response.

This includes remembering the past history of the conversation.


Get the state dict for saving

Override this method for more specific saving.


Save model parameters to path (or default to model_file arg).

Please try to refrain from overriding this function, and instead override state_dict(self) for more specific saving.


Load the state dict into model.

This is easily overridable to facilitate transfer of state dicts.


Return opt and model states.

Override this method for more specific loading.


Clear internal states.


Call batch_act with the singleton batch.


Process a batch of observations (batchsize list of message dicts).

These observations have been preprocessed by the observe method.

Subclasses can override this for special functionality, but if the default behaviors are fine then just override the train_step and eval_step methods instead. The former is called when labels are present in the observations batch; otherwise, the latter is called.


[Abstract] Process one batch with training labels.


[Abstract] Process one batch but do not train on it.


Perform a backward pass. It is recommended you use this instead of loss.backward(), for integration with distributed training and FP16 training.


Perform step of optimization, clipping gradients and adjusting LR schedule if needed. Gradient accumulation is also performed if agent is called with –update-freq.

It is recommended (but not forced) that you call this in train_step.


Zero out optimizer.

It is recommended you call this in train_step. It automatically handles gradient accumulation if agent is called with –update-freq.

Torch Ranker Agent

class parlai.core.torch_ranker_agent.TorchRankerAgent(opt, shared=None)

Bases: parlai.core.torch_agent.TorchAgent

classmethod add_cmdline_args(argparser)

Add the default commandline args we expect most agents to want.

__init__(opt, shared=None)

Initialize agent.

score_candidates(batch, cand_vecs, cand_encs=None)

Given a batch and candidate set, return scores (for ranking).

  • batch (Batch) – a Batch object (defined in

  • cand_vecs (LongTensor) – padded and tokenized candidates

  • cand_encs (FloatTensor) – encoded candidates, if these are passed into the function (in cases where we cache the candidate encodings), you do not need to call self.model on cand_vecs


Build a new model (implemented by children classes)


Override from TorchAgent.


Train on a single batch of examples.


Evaluate a single batch of examples.


Share model parameters.


Reset metrics.


Report loss and mean_rank from model’s perspective.


Load the tokens from the vocab as candidates

self.vocab_candidates will contain a [num_cands] list of strings self.vocab_candidate_vecs will contain a [num_cands, 1] LongTensor


Load a set of fixed candidates and their vectors (or vectorize them here)

self.fixed_candidates will contain a [num_cands] list of strings self.fixed_candidate_vecs will contain a [num_cands, seq_len] LongTensor

See the note on the –fixed-candidate-vecs flag for an explanation of the ‘reuse’, ‘replace’, or path options.

Note: TorchRankerAgent by default converts candidates to vectors by vectorizing in the common sense (i.e., replacing each token with its index in the dictionary). If a child model wants to additionally perform encoding, it can overwrite the vectorize_fixed_candidates() method to produce encoded vectors instead of just vectorized ones.


Convert a batch of candidates from text to vectors


cands_batch – a [batchsize] list of candidates (strings)


a [num_cands] list of candidate vectors

By default, candidates are simply vectorized (tokens replaced by token ids). A child class may choose to overwrite this method to perform vectorization as well as encoding if so desired.

Torch Generator Agent

Generic PyTorch-based Generator agent. Implements quite a bit of boilerplate, including forced-decoding loss and beam search.

Contains the following utilities:

  • TorchGeneratorAgent class, which serves as a useful parent for generative torch agents.

  • Beam class which provides some generic beam functionality for classes to use

class parlai.core.torch_generator_agent.TorchGeneratorModel(padding_idx=0, start_idx=1, end_idx=2, unknown_idx=3, input_dropout=0, longest_label=1)

Bases: torch.nn.modules.module.Module

This Interface expects you to implement model with the following reqs:

Attribute model.encoder

takes input returns tuple (enc_out, enc_hidden, attn_mask)

Attribute model.decoder

takes decoder params and returns decoder outputs after attn

Attribute model.output

takes decoder outputs and returns distr over dictionary

__init__(padding_idx=0, start_idx=1, end_idx=2, unknown_idx=3, input_dropout=0, longest_label=1)

Initialize self. See help(type(self)) for accurate signature.

decode_greedy(encoder_states, bsz, maxlen)

Greedy search

  • bsz (int) – Batch size. Because encoder_states is model-specific, it cannot infer this automatically.

  • encoder_states (Model specific) – Output of the encoder model.

  • maxlen (int) – Maximum decoding length


pair (logits, choices) of the greedy decode

Return type

(FloatTensor[bsz, maxlen, vocab], LongTensor[bsz, maxlen])

decode_forced(encoder_states, ys)

Decode with a fixed, true sequence, computing loss. Useful for training, or ranking fixed candidates.

  • ys (LongTensor[bsz, time]) – the prediction targets. Contains both the start and end tokens.

  • encoder_states (model specific) – Output of the encoder. Model specific types.


pair (logits, choices) containing the logits and MLE predictions

Return type

(FloatTensor[bsz, ys, vocab], LongTensor[bsz, ys])

reorder_encoder_states(encoder_states, indices)

Reorder encoder states according to a new set of indices.

This is an abstract method, and must be implemented by the user.

Its purpose is to provide beam search with a model-agnostic interface for beam search. For example, this method is used to sort hypotheses, expand beams, etc.

For example, assume that encoder_states is an bsz x 1 tensor of values

indices = [0, 2, 2]
encoder_states = [[0.1]

then the output will be

output = [[0.1]
  • encoder_states (model specific) – output from encoder. type is model specific.

  • indices (list[int]) – the indices to select over. The user must support non-tensor inputs.


The re-ordered encoder states. It should be of the same type as encoder states, and it must be a valid input to the decoder.

Return type

model specific

reorder_decoder_incremental_state(incremental_state, inds)

Reorder incremental state for the decoder.

Used to expand selected beams in beam_search. Unlike reorder_encoder_states, implementing this method is optional. However, without incremental decoding, decoding a single beam becomes O(n^2) instead of O(n), which can make beam search impractically slow.

In order to fall back to non-incremental decoding, just return None from this method.

  • incremental_state (model specific) – second output of model.decoder

  • inds (LongTensor[n]) – indices to select and reorder over.


The re-ordered decoder incremental states. It should be the same type as incremental_state, and usable as an input to the decoder. This method should return None if the model does not support incremental decoding.

Return type

model specific

forward(*xs, ys=None, cand_params=None, prev_enc=None, maxlen=None, bsz=None)

Get output predictions from the model.

  • xs (LongTensor[bsz, seqlen]) – input to the encoder

  • ys (LongTensor[bsz, outlen]) – Expected output from the decoder. Used for teacher forcing to calculate loss.

  • prev_enc – if you know you’ll pass in the same xs multiple times, you can pass in the encoder output from the last forward pass to skip recalcuating the same encoder output.

  • maxlen – max number of tokens to decode. if not set, will use the length of the longest label this model has seen. ignored when ys is not None.

  • bsz – if ys is not provided, then you must specify the bsz for greedy decoding.


(scores, candidate_scores, encoder_states) tuple

  • scores contains the model’s predicted token scores. (FloatTensor[bsz, seqlen, num_features])

  • candidate_scores are the score the model assigned to each candidate. (FloatTensor[bsz, num_cands])

  • encoder_states are the output of model.encoder. Model specific types. Feed this back in to skip encoding on the next call.

class parlai.core.torch_generator_agent.TorchGeneratorAgent(opt, shared=None)

Bases: parlai.core.torch_agent.TorchAgent

Abstract Generator agent. Only meant to be extended.

TorchGeneratorAgent aims to handle much of the bookkeeping and infrastructure work for any generative models, like seq2seq or transformer. It implements the train_step and eval_step. The only requirement is that your model must implemented the interface TorchGeneratorModel interface.

classmethod add_cmdline_args(argparser)

Add the default commandline args we expect most agents to want.

__init__(opt, shared=None)

Initialize agent.


Construct the model.

The model should be set to self.model, and support the TorchGeneratorModel interface.


Constructs the loss function. By default torch.nn.CrossEntropyLoss. The criterion function should be set to self.criterion.

If overridden, this model should (1) handle calling cuda and (2) produce a sum that can be used for a per-token loss.


Reset metrics for reporting loss and perplexity.


Share internal states between parent and child instances.


Report loss and perplexity from model’s perspective.

Note that this includes predicting __END__ and __UNK__ tokens and may differ from a truly independent measurement.

vectorize(*args, **kwargs)

Override vectorize for generative models.

compute_loss(batch, return_output=False)

Computes and returns the loss for the given batch. Easily overridable for customized loss functions.

If return_output is True, the full output from the call to self.model() is also returned, via a (loss, model_output) pair.


Train on a single batch of examples.


Evaluate a single batch of examples.

Beam search given the model and Batch

This function expects to be given a TorchGeneratorModel. Please refer to that interface for information.

  • model (TorchGeneratorModel) – Implements the above interface

  • batch (Batch) – Batch structure with input and labels

  • beam_size (int) – Size of each beam during the search

  • start (int) – start of sequence token

  • end (int) – end of sequence token

  • pad (int) – padding token

  • min_length (int) – minimum length of the decoded sequence

  • min_n_best (int) – minimum number of completed hypothesis generated from each beam

  • max_ts (int) – the maximum length of the decoded sequence


tuple (beam_pred_scores, n_best_pred_scores, beams)

  • beam_preds_scores: list of (prediction, score) pairs for each sample in Batch

  • n_best_preds_scores: list of n_best list of tuples (prediction, score) for each sample from Batch

  • beams :list of Beam instances defined in Beam class, can be used for any following postprocessing, e.g. dot logging.

class parlai.core.torch_generator_agent.PerplexityEvaluatorAgent(opt, shared=None)

Bases: parlai.core.torch_generator_agent.TorchGeneratorAgent

Subclass for doing standardized perplexity evaluation.

This is designed to be used in conjunction with the PerplexityWorld at parlai/scripts/ It uses the next_word_probability function to calculate the probability of tokens one token at a time.

__init__(opt, shared=None)

Initialize evaluator.


Return probability distribution over next words.

This probability is based on both nn input and partial true output. This is used to calculate the per-word perplexity.

  • observation – input observation dict

  • partial_out – list of previous “true” words


a dict, where each key is a word and each value is a probability score for that word. Unset keys will use a probability of 1e-7.

e.g. {‘text’: ‘Run test program.’}, [‘hello’] => {‘world’: 1.0}

class parlai.core.torch_generator_agent.Beam(beam_size, min_length=3, padding_token=0, bos_token=1, eos_token=2, min_n_best=3, cuda='cpu', block_ngram=0)

Bases: object

Generic beam class. It keeps information about beam_size hypothesis.

__init__(beam_size, min_length=3, padding_token=0, bos_token=1, eos_token=2, min_n_best=3, cuda='cpu', block_ngram=0)

Instantiate Beam object.

  • beam_size – number of hypothesis in the beam

  • min_length – minimum length of the predicted sequence

  • padding_token – Set to 0 as usual in ParlAI

  • bos_token – Set to 1 as usual in ParlAI

  • eos_token – Set to 2 as usual in ParlAI

  • min_n_best – Beam will not be done unless this amount of finished hypothesis (with EOS) is done

  • cuda – What device to use for computations

static find_ngrams(input_list, n)

Get list of ngrams with context length n-1


Get the outputput at the current step.


Get the backtrack at the current step.


Advance the beam one step.


Return whether beam search is complete.


Get single best hypothesis.


hypothesis sequence and the final score


Extract hypothesis ending with EOS at timestep with hyp_id.

  • timestep – timestep with range up to len(self.outputs)-1

  • hyp_id – id with range up to beam_size-1


hypothesis sequence

static get_pretty_hypothesis(list_of_hypotails)

Return prettier version of the hypotheses.


Return finished hypotheses in rescored order.


n_best – how many n best hypothesis to return


list with hypothesis


Check if self.finished is empty and add hyptail in that case.

This will be suboptimal hypothesis since the model did not get any EOS

get_beam_dot(dictionary=None, n_best=None)

Create pydot graph representation of the beam.

  • outputs – self.outputs from the beam

  • dictionary – tok 2 word dict to save words in the tree nodes


pydot graph