core.dict

Contains code for parsing and building a dictionary from text.

parlai.core.dict.escape(s)

Replace potential special characters with escaped version.

For example,

=> n and => t

param s

string to escape

parlai.core.dict.unescape(s)

Revert escaped characters back to their special version.

For example, n =>

and t =>

param s

string to unescape

parlai.core.dict.find_ngrams(token_dict, text, n)

Break text into ngrams that appear in token_dict.

Parameters
  • token_dictdict to check for ngrams

  • textstr to look for ngrams in

  • nint max size of ngrams

class parlai.core.dict.DictionaryAgent(opt, shared=None)

Bases: parlai.core.agents.Agent

Builds and/or loads a dictionary.

The dictionary provides access to the frequency of each token, functions to translate sentences from tokens to their vectors (list of ints, each int is the index of a token in the dictionary) and back from vectors to tokenized text.

static add_cmdline_args(argparser)

Add commandline arguments related to the dictionary.

__init__(opt, shared=None)

Initialize DictionaryAgent.

copy_dict(dictionary)

Overwrite own state with any state in the other dictionary. This allows loading of the contents of another dictionary while keeping the current dictionary version.

spacy_span_tokenize(text)

Returns tuple of tokens, spans.

nltk_tokenize(text, building=False)

Uses nltk-trained PunktTokenizer for sentence tokenization and Treebank Word Tokenizer for tokenizing words within sentences.

static re_tokenize(text)

Find boundaries between word characters, newlines, and non-word non-whitespace tokens (r'[\w\n]+ | [^\w\s] | \n').

This splits along whitespace and punctuation and keeps the newline as a token in the returned list.

static split_tokenize(text)

Splits tokens based on whitespace after adding whitespace around punctuation.

Use re_tokenize if you want more robust handling of punctuation.

span_tokenize(text)

Tokenizes, and then calculates the starting index of each token in the original string.

tokenize(text, building=False)

Returns a sequence of tokens from the iterable.

bpe_tokenize(text)

Return a sequence of BPE-tokens from the text.

add_to_dict(tokens)

Build dictionary from the list of provided tokens.

remove_tail(min_freq)

Remove elements below the frequency cutoff from the dictionary.

resize_to_max(maxtokens)

Trims the dictionary to the maximum number of tokens.

load(filename)

Load pre-existing dictionary in ‘token[<TAB>count]’ format.

Initialize counts from other dictionary, or 0 if they aren’t included.

save(filename=None, append=False, sort=True)

Save dictionary to file.

Format is ‘token<TAB>count’ for every token in the dictionary, sorted by count with the most frequent words first.

If append (default False) is set to True, appends instead of overwriting.

If sort (default True), then first sort the dictionary before saving.

sort(trim=True)

Sorts the dictionary, so that the elements with the lowest index have the highest counts. This reindexes the dictionary according to the sorted frequencies, breaking ties alphabetically by token.

Parameters

trim (bool) – If True, truncate the dictionary based on minfreq and maxtokens.

parse(txt_or_vec, vec_type=<class 'list'>)

Convenience function for parsing either text or vectors of indices.

Parameters

vec_type – type of the returned vector if the input is a string.

txt2vec(text, vec_type=<class 'list'>)

Converts a string to a vector (list of ints).

First runs a sentence tokenizer, then a word tokenizer.

vec_type is the type of the returned vector if the input is a string.

vec2txt(vector, delimiter=' ')

Converts a vector (iterable of ints) into a string, with each token separated by the delimiter (default ' ').

act()

Add words in the last observation to the dictionary.

This checks any fields in the message present in the –dict-textfields argument (e.g. “text,labels”).

share()

Share internal dicts.

shutdown()

Save on shutdown if save_path is set.