Implementation of Text Mining (baby steps)

STEP : 

  • Tokenizing

  • Stemming

  •  Analyzing

  • Result / Knowledge

Install nltk : pip install nltk Install sklearn : pip install scikit-learn

Open Visual Studio Code, then type this :

import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# download nltk corpus (first time only) import nltk
nltk.download("all")
# Load the amazon review dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv"
)


def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stop words
    filtered_tokens = [
        token for token in tokens if token not in stopwords.words("english")
    ]
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Join the tokens back into a string
    processed_text = " ".join(lemmatized_tokens)
    return processed_text


analyzer = SentimentIntensityAnalyzer()
# create get_sentiment function
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores["pos"] > 0 else 0
    return sentiment


# apply get_sentiment function
df["sentiment"] = df["reviewText"].apply(get_sentiment)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(df["Positive"], df["sentiment"]))
from sklearn.metrics import classification_report
print(classification_report(df["Positive"], df["sentiment"]))



Running them, and here's the result :

precision recall f1-score support 0 0.69 0.29 0.41 4767 1 0.81 0.96 0.88 15233 accuracy 0.80 20000 macro avg 0.75 0.62 0.64 20000 weighted avg 0.78 0.80 0.77 20000

Komentar

Postingan populer dari blog ini

Panduan Menginstall Numpy, Pandas, dan Matplotlib pada Python Visual Studio Code