{"id":584,"date":"2017-01-27T00:30:40","date_gmt":"2017-01-26T16:30:40","guid":{"rendered":"https:\/\/www.techcoil.com\/blog\/?p=584"},"modified":"2018-09-05T00:22:10","modified_gmt":"2018-09-04T16:22:10","slug":"how-to-generate-n-grams-in-python-without-using-any-external-libraries","status":"publish","type":"post","link":"https:\/\/www.techcoil.com\/blog\/how-to-generate-n-grams-in-python-without-using-any-external-libraries\/","title":{"rendered":"How to generate n-grams in Python without using any external libraries"},"content":{"rendered":"<p>There are many text analysis applications that utilize n-grams as a basis for building prediction models. The term \"n-grams\" refers to individual or group of words that appear consecutively in text documents.  <\/p>\n<p>In this post, I document the Python codes that I typically use to generate n-grams without depending on external python libraries. <\/p>\n<h2>Steps to generate n-grams from a large string of text<\/h2>\n<p>I usually break up the task of generating n-grams from a large string of text into the following subtasks:<\/p>\n<ol>\n<li>Preprocess a large string of text and break them into a list of words.<\/li>\n<li>Generate n-grams from a list of words.<\/li>\n<\/ol>\n<h3>Codes to preprocess a large string of text and break them into a list of words<\/h3>\n<p>I typically use the following function to preprocess the text before the generation of n-grams:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ndef process_text(text):\r\n\r\n    text = text.lower()\r\n    text = text.replace(',', ' ')\r\n    text = text.replace('\/', ' ')\r\n    text = text.replace('(', ' ')\r\n    text = text.replace(')', ' ')\r\n    text = text.replace('.', ' ')\r\n\r\n    # Convert text string to a list of words\r\n    return text.split()\r\n    \r\n<\/pre>\n<p>The <code>process_text<\/code> function accepts an input parameter as the text which we want to preprocess.<\/p>\n<p>It first converts all the characters in the text to lowercases. After that, it replaces commas, forward slashes, brackets and full stops with single whitespaces. Finally, it uses the <code>split<\/code> function on the text to split words by spaces and returns the result.<\/p>\n<p>I will add more character replacement codes depending on where I anticipate the text input comes from. For example, if I am anticipating that the text is coming from a web crawler, I will <a href=\"\/tools\/decode-html-entities-to-html-codes\" target=\"_blank\">perform HTML decoding<\/a> on the text input as well.<\/p>\n<h3>Codes to generate n-grams from a list of words<\/h3>\n<p>I typically use the following function to generate n-grams out of a list of individual words:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ndef generate_ngrams(words_list, n):\r\n    ngrams_list = &#x5B;]\r\n\r\n    for num in range(0, len(words_list)):\r\n        ngram = ' '.join(words_list&#x5B;num:num + n])\r\n        ngrams_list.append(ngram)\r\n\r\n    return ngrams_list\r\n\r\n<\/pre>\n<p>The <code>generate_ngrams<\/code> function accepts two input parameters:<\/p>\n<ol>\n<li>A list of individual words which can come from the output of the <code>process_text<\/code> function.<\/li>\n<li>A number which indicates the number of words in a text sequence.<\/li>\n<\/ol>\n<p>Upon receiving the input parameters, the <code>generate_ngrams<\/code> function declares a list to keep track of the generated n-grams. It then loops through all the words in <code>words_list<\/code> to construct n-grams and appends them to <code>ngram_list<\/code>.<\/p>\n<p>When the loop completes, the <code>generate_ngrams<\/code> function returns <code>ngram_list<\/code> back to the caller.<\/p>\n<h2>Putting together process_text and generate_ngrams functions to generate n-grams<\/h2>\n<p>The following is an example of how I would use the <code>process_text<\/code> and <code>generate_ngrams<\/code> functions in tandem to generate n-grams:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nif __name__ == '__main__':\r\n\r\n    text = 'A quick brown fox jumps over the lazy dog.'\r\n\r\n    words_list = process_text(text)\r\n    unigrams = generate_ngrams(words_list, 1)\r\n    bigrams = generate_ngrams(words_list, 2)\r\n    trigrams = generate_ngrams(words_list, 3)\r\n    fourgrams = generate_ngrams(words_list, 4)\r\n    fivegrams = generate_ngrams(words_list, 5)\r\n\r\n    print (unigrams + bigrams + trigrams + fourgrams + fivegrams)\r\n<\/pre>\n<p>The function first declares the text with the string 'A quick brown fox jumps over the lazy dog.'. It then convert the text to a list of individual words with the <code>process_text<\/code> function. Once <code>process_text<\/code> completes, it uses the <code>generate_ngrams<\/code> function to create 1-gram, 2-gram, 3-gram, 4-gram and 5-gram sequences. Lastly, it prints the generated n-gram sequences to standard output.<\/p>\n<p>Putting the the codes together in a Python script and running them will give me the following output:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&#x5B;'a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'a quick', 'quick brown', 'brown fox', 'fox jumps', 'jumps over', 'over the', 'the lazy', 'lazy dog', 'dog', 'a quick brown', 'quick brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog', 'lazy dog', 'dog', 'a quick brown fox', 'quick brown fox jumps', 'brown fox jumps over', 'fox jumps over the', 'jumps over the lazy', 'over the lazy dog', 'the lazy dog', 'lazy dog', 'dog', 'a quick brown fox jumps', 'quick brown fox jumps over', 'brown fox jumps over the', 'fox jumps over the lazy', 'jumps over the lazy dog', 'over the lazy dog', 'the lazy dog', 'lazy dog', 'dog']\r\n<\/pre>\n\n      <ul id=\"social-sharing-buttons-list\">\n        <li class=\"facebook\">\n          <a href=\"https:\/\/www.facebook.com\/sharer\/sharer.php?u=https%3A%2F%2Fwp.me%2Fp245TQ-9q\" target=\"_blank\" role=\"button\" rel=\"nofollow\">\n            <img decoding=\"async\" src=\"\/ph\/img\/3rd-party\/social-icons\/Facebook.png\" alt=\"Facebook icon\"> Share\n          <\/a>\n        <\/li>\n        <li class=\"twitter\">\n          <a href=\"https:\/\/twitter.com\/intent\/tweet?text=&url=https%3A%2F%2Fwp.me%2Fp245TQ-9q&via=Techcoil_com\" target=\"_blank\" role=\"button\" rel=\"nofollow\">\n          <img decoding=\"async\" src=\"\/ph\/img\/3rd-party\/social-icons\/Twitter.png\" alt=\"Twitter icon\"> Tweet\n          <\/a>\n        <\/li>\n        <li class=\"linkedin\">\n          <a href=\"https:\/\/www.linkedin.com\/shareArticle?mini=1&title=&url=https%3A%2F%2Fwp.me%2Fp245TQ-9q&source=https:\/\/www.techcoil.com\" target=\"_blank\" role=\"button\" rel=\"nofollow\">\n          <img decoding=\"async\" src=\"\/ph\/img\/3rd-party\/social-icons\/linkedin.png\" alt=\"Linkedin icon\"> Share\n          <\/a>\n        <\/li>\n        <li class=\"pinterest\">\n          <a href=\"https:\/\/pinterest.com\/pin\/create\/button\/?url=https%3A%2F%2Fwww.techcoil.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F584&description=\" class=\"pin-it-button\" target=\"_blank\" role=\"button\" rel=\"nofollow\" count-layout=\"horizontal\">\n          <img decoding=\"async\" src=\"\/ph\/img\/3rd-party\/social-icons\/Pinterest.png\" alt=\"Pinterest icon\"> Save\n          <\/a>\n        <\/li>\n      <\/ul>\n    ","protected":false},"excerpt":{"rendered":"<p>There are many text analysis applications that utilize n-grams as a basis for building prediction models. The term &#8220;n-grams&#8221; refers to individual or group of words that appear consecutively in text documents.  <\/p>\n<p>In this post, I document the Python codes that I typically use to generate n-grams without depending on external python libraries. <\/p>\n","protected":false},"author":1,"featured_media":1244,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":true,"_jetpack_newsletter_tier_id":0,"footnotes":""},"categories":[375],"tags":[370,373,226,372],"jetpack_featured_media_url":"https:\/\/www.techcoil.com\/blog\/wp-content\/uploads\/Python-Logo.gif","jetpack_shortlink":"https:\/\/wp.me\/p245TQ-9q","jetpack-related-posts":[],"jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/posts\/584"}],"collection":[{"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/comments?post=584"}],"version-history":[{"count":0,"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/posts\/584\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/media\/1244"}],"wp:attachment":[{"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/media?parent=584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/categories?post=584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.techcoil.com\/blog\/wp-json\/wp\/v2\/tags?post=584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}