Abstract
Sentiment analysis helps us understand the emotions behind text, such as whether people feel positive, negative, or neutral about a topic. It is useful for analyzing opinions on social media, reviews, or other text data. In this project, you will gather text data on a topic of your choice and use a sentiment analysis tool called VADER (Valence Aware Dictionary and sEntiment Reasoner).
Summary
None
Readily available
No issues
Objective
Collect text data from social media posts, news articles, books, etc., and explore how well the VADER sentiment analysis tool can identify positive and negative sentiments in human text.
Introduction
Language is a powerful tool we use to express our thoughts and feelings. But how do we know if someone is happy or upset? Is it in the way they talk or the words they choose? Sometimes it is clear – someone might say, “I’m so happy!” with excitement, or “I’m mad” while storming off. Other times, it is less obvious. For example, someone might say, “I’m fine,” but their tone and body language suggest they’re not.
Therapists and behavior experts spend years learning to understand the hidden emotions in people's words and actions. But they can only focus on one person at a time. What if we need to understand how millions of people feel about something?
For instance, how can companies analyze thousands of reviews to see if customers like their product? How can celebrities track public opinion across social media comments? How can politicians see whether news coverage about them is mostly positive or negative?
This is where sentiment analysis comes into play. Sentiment analysis is a technique in Natural Language Processing (NLP), a branch of Artificial Intelligence (AI) that teaches computers to understand human language. With NLP, computers can process large amounts of text to figure out if the overall mood is positive, negative, or neutral.
Watch this video for a more detailed explanation of NLP:
One tool for sentiment analysis is VADER (Valence Aware Dictionary and sEntiment Reasoner). In this project, you will collect text data from sources like social media, news articles, reviews, or books. Then, you will use VADER to analyze the sentiments in these texts and compare the results. This will show how sentiment analysis can help us understand emotions and opinions on a larger scale.
Terms and Concepts
- Natural Language Processing (NLP)
- Artificial Intelligence (AI)
- Sentiment analysis
- VADER (Valence Aware Dictionary and sEntiment Reasoner)
- Tokenization
- Part-of-Speech (POS) tagging
Questions
- Why might it sometimes be difficult to tell how someone feels based on their words alone?
- What is Natural Language Processing (NLP), and why is it important?
- What is sentiment analysis, and how can it be useful?
- How might a business use sentiment analysis to improve its products or services?
- What challenges do you think might arise when analyzing text from different cultures or languages?
Bibliography
To learn more about NLP:
- CrashCourse. (2019, September). Natural Language Processing: Crash Course AI #7. YouTube. Retrieved December 12, 2024.
- IBM Technology. (2021, August). What is NLP (Natural Language Processing)?. YouTube. Retrieved December 12, 2024.
- Simplilearn. (2021, May) Natural Language Processing In 5 Minutes | What is NLP and How Does It Work? Simplilearn. YouTube. Retrieved December 12, 2024.
VADER (Valence Aware Dictionary and sEntiment Reasoner) source code:
- cjhutto. (2022, April). VADER-Sentiment-Analysis. GitHub. Retrieved December 12, 2024.
RoBERTa source code:
- FacebookAI. (n.d.). roberta-large-mnli. Hugging Face. Retrieved December 12, 2024.
Hugging Face Transformers NLP source code:
- huggingface. (n.d.) Transformers. GitHub. Retrieved December 12, 2024.
Materials and Equipment
- Computer with Internet access
Experimental Procedure

Overview
This project introduces you to sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner) via the NLTK (Natural Language ToolKit) library. You will gather and analyze text data to understand patterns in sentiment across various categories.
1. Setting Up the Project Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Navigate to Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “sentiment_analysis.”
- Download the sentiment_analysis.ipynb file from Science Buddies. This is the code you will need to process your data. Upload this file to your “sentiment_analysis” folder in MyDrive.
- Double-click on the sentiment_analysis.ipynb file. This should automatically open in Google Colab.
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
- Run the blocks under Importing Libraries to ensure you have access to all the functions we will use for this project. We are importing quite a few libraries for this project, so do not worry if this takes a bit of time to run.
2. Gathering Our Data
In this section, you will collect data from a source of your choice, giving you the flexibility to design your project as you see fit. You have a wide range of options, such as social media posts, news articles, reviews, and more.
For example, you could compare one set of the following:
- Social media posts from your friends
- News articles from different websites
- Movie or product reviews
Pick at least two different sources for whichever type of text you choose (e.g. social media posts from two friends, news articles from two different websites, reviews for two different movies).
- Within the “sentiment_analysis” folder on your MyDrive, create subfolders for each category you are analyzing.
- For example, if you are comparing tweets from your friends, create a separate folder for each friend and name it accordingly (e.g., one folder named “Andy,” another “Wendy”) and so on.
- Using a text editor, copy and paste the text you want to analyze, then save it as a plain text file (.txt). Finally, upload the .txt file to the corresponding subfolder in your “sentiment_analysis” folder.
- You might find that it is helpful to name the .txt files after the folder as well (e.g., Andy01.txt, Andy02.txt, and so on).
- Aim to have at least 10 different .txt files for each subfolder.
3. Natural Language ToolKit (NLTK) Basics
Before diving into the project, we will first review the basics of the Natural Language ToolKit (NLTK) library, which includes the VADER sentiment analysis tool.
-
(Code Block 3A) Run this code block to tokenize a sentence of your choice. Tokenization is the process of breaking down a string of text into smaller components, such as words or punctuation marks.
- To try it out, replace the current sentence inside the quotation marks with any sentence you like and observe how the NLTK library tokenizes it!
-
(Code Block 3B) This code block uses the NLTK library to perform Part-of-Speech (POS) tagging on a list of tokens. To explore the available POS tags, refer to the text block above this code block.
- This code block processes the same sentence you used in Code Block 3A. If you would like to see the POS tagging for a different sentence, simply update the sentence in Code Block 3A and rerun both code blocks.
4. Test VADER Sentiment Scoring
Before applying VADER to the entire dataset, let’s first explore how it analyzes a few individual sentences.
- (Code Block 4A) This code block creates a SentimentIntensityAnalyzer, which is part of the VADER sentiment analysis tool. Run this code block.
-
(Code Block 4B) This code block uses the SentimentIntensityAnalyzer to analyze the sentiment of the given sentence. Experiment by inputting different positive sentences to see how the analysis responds. Here is what each key means:
- neg: The proportion of the text that conveys negative sentiment (e.g., neg: 0.0 means the text does not contain any negative sentiment).
- neu: The proportion of the text that is neutral (e.g., neu: 0.4 means about 40% of the text is considered neutral).
- pos: The proportion of the text that conveys positive sentiment (e.g., pos: 0.6 means about 60% of the text is considered positive).
- compound: An overall sentiment score normalized to a range from -1 (most negative) to +1 (most positive) (e.g., compound: 0.8395 means that the overall sentiment is strongly positive. The compound score is a weighted average of the other scores, with adjustments for the intensity of positive or negative words).
- (Code Block 4C) This code block also uses the SentimentIntensityAnalyzer to analyze the sentiment of the given sentence. Experiment by inputting different negative sentences to see how the analysis responds.
- (Code Block 4D) This code block is the same as Code Block 4B and 4C. This time, try to challenge the SentimentIntensityAnalyzer by crafting sentences that could confuse it–such as making a negative sentence seem positive or vice versa (e.g., using sarcasm).
5. Using VADER on your Dataset
- (Code Block 5A) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your MyDrive populate in a table below the code block.
- (Code Block 5B) This code block calculates sentiment scores for each .txt file you uploaded earlier and stores the results. Run this code block.
- (Code Block 5C) This code block takes the results from Code Block 5B and adds the results to the original DataFrame.
6. Visualize the Data
- (Code Block 6A) This code block creates a bar plot to visualize the average compound sentiment score for each label in the DataFrame. Run this code block.
- Which label had on average a higher compound score? Lower compound score? Why do you think so?
- (Code Block 6B) This code block creates a series of bar plots to compare the sentiment scores (positive, neutral, negative) across different labels in the dataset. Run this code block.
- What trends or patterns do you notice in the positive, neutral, and negative sentiment scores for each label?
- Which label has the highest/lowest positive sentiment? How about neutral and negative sentiment?
- Are there any labels with similar sentiment distributions? What does this suggest?
7. Review Individual Examples
- (Code Block 7A) You can review examples one at a time in this code block. To view different examples, change the row_number to any number between 0 and one less than the total number of .txt files you have. (e.g. If you have 30 .txt files, you can change this number to be between 0 and 29).
- Can you find any examples that you thought were positive but VADER gave them a more negative score, or vice versa?
- Why do you think VADER may have a hard time interpreting sarcasm?
Ask an Expert
Variations
- Compare the results of the VADER sentiment analysis tool with the RoBERTa model. You can download the roberta_sentiment_analysis.ipynb notebook here. Additional details about the RoBERTa model are available in the Bibliography. Will RoBERTa outperform VADER?
- For a unique challenge, focus exclusively on analyzing sarcastic text. How well does the VADER model handle sarcasm?
Careers
If you like this project, you might enjoy exploring these related careers:








