Monty Python in a Word Cloud

Have Some Fun Insights on
The Holy Grail Script by Monty Python

Posted by Maya Sandler on December 8, 2020

Intent

Word cloud is a technique for visualizing frequent words in a text when the size of the words represents their frequency in the text. In this project I use word cloud and also talk about its advantages and disadvantages.

I have never tried creating a word cloud in Python and I was looking for a nice task to do. I searched for a text that will not be too serious, will be fun to work on and I will get some funny insights from. I immediately chose "Monty Python and The Holy Grail" script, as I adore this movie(!), and because I had some thoughts about the most frequent words in the script and I was curious to find out if my thoughts were correct.

I download the script, imported the text to Python, cleaned and preprocessed the data, and finally analyzed and visualized it to the result I will share with you later in this project.

In the last part of the project, I have some pointers to future projects I can do with this fun and instigative method.

Description of the Data

The database contains the following:

The interesting text that is said by the characters.
Sound effects and music in the movie.
Information about the speaker of each line.
blank lines.

Import the Dataset

The script was download the from this website and imported to Python. The script was made out of 23 scenes that was joined into a single file and was inserted into a string.

Preprocessing

Remove the Speakers From the Fata

The text contained the speaker of each line in the script. This was data that had to be removed to avoid counting the frequencies of their names in the text. I wanted to check the frequency of spoken words in the script and maintaining these words in the string would cause them to appear in high frequency (i.e. false positive).

This is an example of the original text:

Speakers were noted in the script by capital letters followed by a colon. To remove the speakers the Python code looked for the first word/s in the line until the colon character and saved only the part of the lines that came after the colon:

Remove Sound Effects and Music From the Data

The script also contained all the sound effects and background music in the movie. These had to be removed from the data otherwise leading to wrong analysis and conclusions. The sounds in the script were enclosed in square brackets and a single line could include several elements of [sound effects]. This unnecessary data was also removed from the string via code:

Remove Unwanted Characters

The text also included comas, periods, colons, apostrophes, exclamation marks etc. that had to be removed in order to process only the words in the text:

In a word cloud, we are interested of specific words that have significance in the text. Mainly we are looking form nouns, verbs, adjective. I was also interested in interjections - It is Monty Python! However, most of the wording in texts contains a high frequency of words that will not contribute to our understating of trends in the text, like pronouns, prepositions, conjunctions. Therefore, these words needed to be removed from the word list.

The first step was to turn all text into lowercase (to avoid removing of word like "no", but keeping the word "No") via the lower() method. The second step was to remove all spaces and empty lines via the split() method. The third step was to delete specific words, as described above, from the word list. The final step was adding it all into a single string where words are separated via a single space, as this is the input for the wordcloud processing.

Interesting Vs Uninteresting Words, Change to Lowercase and Remove Spaces and Empty Lines

Change Words to Singular Form

Some words appear in their plural form and will not be added to words with the same meaning in their singular form. In order to fix this, I printed only the words that end with "s" and wrote a code to go over the list and change them to their singular form.

Supper! Now we only have single interesting words in a string format that we can put into analysis and see interesting things.

Creating the Word Cloud

Python Libraries Needed for the Task

I looked for a non - square output, because I have done that and it is definitely less fun and less meaningful. I looked into creating a word cloud inside of an image, and found that this function exists in GitHub, but everyone is using something else. Eventually I found this way to be the shortest code, easiest to execute and had great results:

I needed several libraries:

Numpy and PIL were used for the background image, called the "mask".
Wordcloud was used to analyze the frequency of unique words in a string.
Matplotlib.pyplot was used for the plot of the resulted image.

Create a Mask from an Image of my Choice

the PIL library was used to open the image of my choice, and then the mask is created from a numpy array that is created from that image. This code create a black mask on a white background, as a default:

Create the Word Cloud

the WordCloud was used set all the parameters in the final representation, like resolution (width and height), background_color, contour_width and contour_color. These parameters was fun to play with but I finally decided to go for white on white, so that the words will pop out and they will create the image, rather than the contour of the image.

WordCloud was also used to set the max_words parameter for the analysis. The number in this property must be less than the unique words, otherwise words will repeat themselves in the representation. Thus I added a counter for the unique words, in addition to a counter of lines and words in the original file. The text has 439 unique words, therefore, I set max_words to 400 (losing some words that were shown once in the text). To see the most frequent words, I can change it to 20-25 words.

* Important note: The final code doesn't include the generate() method - see next section (verify the result).

OK, now to plot the image and save it to a file:

Verify the Result - Fix A Feature That Created a Bug in my Project

Turns out that WordCloud's generate() has a feature to notice repeating pairs of words. It thought that "Lancelot! Lancelot! Lancelot!" should become a pair of the words "Lancelot Lancelot", which led to a mistake in the analysis.

To verify it I printed the list of single words to be analyzed by WordCloud and compared it to the output words after the generate() method.

To overcome this bug in the analysis, I used a different WordCloud's method called generate_from_frequencies() (instead of the generate() method) that requires a dictionary of unique words as keys and the frequency they appear in the text as the values:

Create a dictionary with frequent words

Many of unique words (after reduction of unwanted words) in the script showed up only once in the text and can create noise in the word cloud. Therefore, a new dictionary was created that included only frequent words - words that apreaed in the text more than once, using this code:

Results:

Frequency of (Almost) All Words in the Script

The scrip ran over 1466 words and 467 lines from 23 text files (the number of scenes in the movie), and pulled out and analyzed 400 unique words. The size of all the words is shown in the image below in proportion to their frequency in the script. This however gives too much noise in my opinion, as it includes words that showed even once in the script:

Could you tell I the mask-image I chose for this analysis? It is the killer bunny from the movie! Ha!

The Frequent Words in The Holy Grail movie?

To find the most frequent words in the movie and to reduce the noise in the result, I added a restriction that shows only words that are frequent (i.e. appear in the text more than once). This script counted 123 unique words that appear in the text more than once and represented them in the below image. In my opinion. this is too little of information points, as we can't see the difference between frequent words and non-frequent words:

A Table of The Words and Their Frequency:

The traditional way is to print a table of the words and their frequency of appearance in the movie in a descending order.

The most frequent words are: Question (appearing 17 times in the movie), Lancelot (appearing 12 times), Swallow (appearing 11 times), Oh (appearing 11 times), Come (appearing 11 times ) and Ha (appearing 11 times):

Conclusions

About this project:

In this project we analyzed all the words in the movie "The Holy Grail" by Monty Python. This analysis can produce a table / chart that shows the words from most to less frequent. However, this result doesn't give us a way to really grasp the words in their full meaning and the creator's intentions.

Interestingly, in this movie there is a high frequency of interjections, and elements to make noises (coconut, oh, ha, shh, etc.), in addition to the regular names of places and people you see in other movies. This high frequency of these words is meant to give the audience a feeling of enjoyment and amusement. It works :)

Conclusions regarding Word cloud

Word cloud has many advantages:

Because reading is an automated action one performs, the method of word cloud is great to grasp multiple words and notions in an eye blink.
After cleaning the data, and preparing it to processing, the procedure of turning the data into a word cloud is easy and quick.
The result is visually appealing and attracts to determine insights and impressions from it.
It can be easily used to analyze data on websites, newspapers, user feedbacks, online searches, and many more.

Nonetheless, word cloud also have weaknesses:

Plural vs singular words - a long list of that must be added individually for transfer, in order not to miss words and trends during analysis.
Difficulty highlighting topics of words (face-nose-eyes) and families of words (me-mine-I). In this project's example, I needed to know words and their meaning in the movie (like interjections and the coconut) in order to classify the words into something meaningful.
Need to decide if to analyze single words (oh), pairs (oh-my) or triplets (my oh my). This changes the results and of course the code behind it.
Lack of context. Many time words have multiple meanings, depending on the context they are said in. word cloud analysis dismantle the sentence into dingle words without context, which may contain a significant information or change the meaning of the word (like the word "red" in "I love this. It is so rad". They don't mean the color).
In my opinion, using all the unique words in a text is too noisy, and using only high frequent unique words may be too little of information. One must find the mid-way: Reducing the noise by limiting the visualized data to a bit more that the high frequency words, but not o much that is disturb the differentiation between the non-frequent and frequent.

We must take into consideration what we want to achieve and if a word cloud is the right way. It is fun and visually appealing, I'll give it that.