Texthero · Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero.

import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df.head(2)

	text	topic
0	Claxton hunting first major medal\n\nBritish h...	athletics
1	O'Sullivan could run in Worlds\n\nSonia O'Sull...	athletics

df['text'] = hero.clean(df['text'])

	text	topic
0	claxton hunting first major medal british hurd...	athletics
1	sullivan could run worlds sonia sullivan indic...	athletics

Look at the preprocessing API for more customization

df['tfidf'] = (
    hero.tfidf(df['text'], max_features=100)
)
df[["tfidf", "topic"]].head(2)

	tfidf	topic
0	[0.0, 0.13194458247285848, 0.0, 0.0, 0.0, 0.0,...	athletics
1	[0.0, 0.13056235989725676, 0.0, 0.205187581391...	athletics

There are many other ways to represent the data

df['pca'] = hero.pca(df['tfidf'])
hero.scatterplot(
    df, 
    col='pca', 
    color='topic', 
    title="PCA BBC Sport news"
)

df['named_entities'] = (
    hero.named_entities(df['text']
)
df[['named_entities', 'topic']].head(2)

	named_entities	topic
0	[(claxton, ORG, 0, 7), (first, ORDINAL, 16, 21...	athletics
1	[(sullivan, ORG, 0, 8), (sonia sullivan, PERSO...	athletics

NUM_TOP_WORDS = 5
hero.top_words(df['text'])[:NUM_TOP_WORDS]