How to generate n-grams in Python without using any external libraries

There are many text analysis applications that utilize n-grams as a basis for building prediction models. The term "n-grams" refers to individual or group of words that appear consecutively in text documents.

In this post, I document the Python codes that I typically use to generate n-grams without depending on external python libraries.

Steps to generate n-grams from a large string of text

I usually break up the task of generating n-grams from a large string of text into the following subtasks:

  1. Preprocess a large string of text and break them into a list of words.
  2. Generate n-grams from a list of words.

Codes to preprocess a large string of text and break them into a list of words

I typically use the following function to preprocess the text before the generation of n-grams:

def process_text(text):

    text = text.lower()
    text = text.replace(',', ' ')
    text = text.replace('/', ' ')
    text = text.replace('(', ' ')
    text = text.replace(')', ' ')
    text = text.replace('.', ' ')

    # Convert text string to a list of words
    return text.split()

The process_text function accepts an input parameter as the text which we want to preprocess.

It first converts all the characters in the text to lowercases. After that, it replaces commas, forward slashes, brackets and full stops with single whitespaces. Finally, it uses the split function on the text to split words by spaces and returns the result.

I will add more character replacement codes depending on where I anticipate the text input comes from. For example, if I am anticipating that the text is coming from a web crawler, I will perform HTML decoding on the text input as well.

Codes to generate n-grams from a list of words

I typically use the following function to generate n-grams out of a list of individual words:

def generate_ngrams(words_list, n):
    ngrams_list = []

    for num in range(0, len(words_list)):
        ngram = ' '.join(words_list[num:num + n])

    return ngrams_list

The generate_ngrams function accepts two input parameters:

  1. A list of individual words which can come from the output of the process_text function.
  2. A number which indicates the number of words in a text sequence.

Upon receiving the input parameters, the generate_ngrams function declares a list to keep track of the generated n-grams. It then loops through all the words in words_list to construct n-grams and appends them to ngram_list.

When the loop completes, the generate_ngrams function returns ngram_list back to the caller.

Putting together process_text and generate_ngrams functions to generate n-grams

The following is an example of how I would use the process_text and generate_ngrams functions in tandem to generate n-grams:

if __name__ == '__main__':

    text = 'A quick brown fox jumps over the lazy dog.'

    words_list = process_text(text)
    unigrams = generate_ngrams(words_list, 1)
    bigrams = generate_ngrams(words_list, 2)
    trigrams = generate_ngrams(words_list, 3)
    fourgrams = generate_ngrams(words_list, 4)
    fivegrams = generate_ngrams(words_list, 5)

    print (unigrams + bigrams + trigrams + fourgrams + fivegrams)

The function first declares the text with the string 'A quick brown fox jumps over the lazy dog.'. It then convert the text to a list of individual words with the process_text function. Once process_text completes, it uses the generate_ngrams function to create 1-gram, 2-gram, 3-gram, 4-gram and 5-gram sequences. Lastly, it prints the generated n-gram sequences to standard output.

Putting the the codes together in a Python script and running them will give me the following output:

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'a quick', 'quick brown', 'brown fox', 'fox jumps', 'jumps over', 'over the', 'the lazy', 'lazy dog', 'dog', 'a quick brown', 'quick brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog', 'lazy dog', 'dog', 'a quick brown fox', 'quick brown fox jumps', 'brown fox jumps over', 'fox jumps over the', 'jumps over the lazy', 'over the lazy dog', 'the lazy dog', 'lazy dog', 'dog', 'a quick brown fox jumps', 'quick brown fox jumps over', 'brown fox jumps over the', 'fox jumps over the lazy', 'jumps over the lazy dog', 'over the lazy dog', 'the lazy dog', 'lazy dog', 'dog']

About Clivant

Clivant a.k.a Chai Heng enjoys composing software and building systems to serve people. He owns and hopes that whatever he had written and built so far had benefited people.