Part of speech tagging - Sequence labelling in Python (part 2)

This is the second post in my series Sequence labelling in Python, find the previous one here: Introduction. Get the code for this series on GitHub.

Our algorithm needs more than the tokens themselves to be more reliable; We can add part of speech as a feature.

To perform the Part-Of-Speech tagging, we'll be using the Stanford POS Tagger; this tagger (or at least the interface to it) is available to use through Python's NLTK library; however, we need to download some models from the Stanford's download page. In our case, since we are working with spanish, we should download the full model under the "2017-06-09 new Spanish and French UD models" subtitle.

Once downloaded, it is necessary to unzip it and keep track of where the files end up being. You could execute:

make models/stanford

To get the necessary files inside a folder called stanford-models. Be aware that you will need to have Java installed for the tagger to work!

Code

Let us start with some imports and loading our dataset:

import json
import pandas as pd

# Load dataset:
vuelos = pd.read_csv('data/vuelos.csv', index_col=0)
with pd.option_context('max_colwidth', 800):
    print(vuelos.loc[:40:5][['label']])

Some of the results:

0                                           ¡CUN a Ámsterdam $8,960! Sin escala en EE.UU
5              ¡GDL a Los Ángeles $3,055! Directos (Agrega 3 noches de hotel por $3,350)
10                      ¡CUN a Puerto Rico $3,296! (Agrega 3 noches de hotel por $2,778)
15    ¡LA a Seúl, regresa desde Tokio 🇯🇵 $8,607! (Por $3,147 agrega 11 noches de hostal)
20                           ¡CDMX a Chile $8,938! (Agrega 9 noches de hotel por $5,933)
25                                               ¡CUN a Holanda $8,885! Sin escala EE.UU
30                              ¡Todo México a París, regresa desde Amsterdam – $11,770!
35  ¡CDMX a Vietnam $10,244! Sin escala en EE.UU (Agrega 15 noches de hostal por $2,082)
40                     ¡CDMX a Europa en Semana Santa $14,984! (París + Ibiza + Venecia)

To interface with the Stanford tagger, we could use the StanforPOSTagger inside the nltk.tag.stanford module, then we create an object passing in both our language-specific model as well as the tagger .jar we previously downloaded from Stanford's website.

Then, as a quick test, we tag a spanish sentence to see what is it that we get back from the tagger.

from nltk.tag.stanford import StanfordPOSTagger

spanish_postagger = StanfordPOSTagger('stanford-models/spanish.tagger', 
                                      'stanford-models/stanford-postagger.jar')

phrase = 'Amo el canto del cenzontle, pájaro de cuatrocientas voces.'
tags = spanish_postagger.tag(phrase.split()) 
print(tags)

The results:

[('Amo', 'vmip000'), ('el', 'da0000'), ('canto', 'nc0s000'), 
('del', 'sp000'), ('cenzontle,', 'dn0000'), ('pájaro', 'nc0s000'), 
('de', 'sp000'), ('cuatrocientas', 'pt000000'), ('voces.', 'np00000')]

The first thing to note is the fact that the tagger takes in lists of strings, not a full sentence, that is why we need to split our sentence before passing it in. A second thing to note is that we get back of tuples; where the first element of each tuple is the token and the second is the POS tag assigned to said token. The POS tags are explained here, and I have made a dictionary for easy lookups.

We can inspect the tokens a bit more:

with open("aux/spanish-tags.json", "r") as r:
    spanish_tags = json.load(r)

for token, tag in tags[:10]:
    print(f"{token:15} -> {spanish_tags[tag]['description']}")

And the results:

Amo             -> Verb (main, indicative, present)
el              -> Article (definite)
canto           -> Common noun (singular)
del             -> Preposition
cenzontle,      -> Numeral
pájaro          -> Common noun (singular)
de              -> Preposition
cuatrocientas   -> Interrogative pronoun
voces.          -> Proper noun

Specific tokenisation

As you may imagine, using split to tokenise our text is not the best idea; it is almost certainly better to create our function, taking into consideration the kind of text that we are going to process. The function above uses the TweetTokenizer and considers flag emojis. As a final touch, it also returns the position of each one of the returned tokens:

from nltk.tokenize import TweetTokenizer

TWEET_TOKENIZER = TweetTokenizer()

# This function exists in vuelax.tokenisation in this same repository
def index_emoji_tokenize(string, return_flags=False):
    flag = ''
    ix = 0
    tokens, positions = [], []
    for t in TWEET_TOKENIZER.tokenize(string):
        ix = string.find(t, ix)
        if len(t) == 1 and ord(t) >= 127462:  # this is the code for 🇦
            if not return_flags: continue
            if flag:
                tokens.append(flag + t)
                positions.append(ix - 1)
                flag = ''
            else:
                flag = t
        else:
            tokens.append(t)
            positions.append(ix)
        ix = +1
    return tokens, positions




label = vuelos.iloc[75]['label']
print(label)
print()
tokens, positions = index_emoji_tokenize(label, return_flags=True)
print(tokens)
print(positions)

And these are the results:

¡LA a Bangkok 🇹🇭$8,442! (Por $2,170 agrega 6 noches de Hotel)

['¡', 'LA', 'a', 'Bangkok', '🇹🇭', '$', '8,442', '!', '(', 'Por', '$', '2,170', 'agrega', '6', 'noches', 'de', 'Hotel', ')']
[0, 1, 4, 6, 14, 16, 17, 22, 24, 25, 16, 30, 36, 43, 45, 52, 55, 60]

Obtaining our ground truth for our problem

We do not need POS Tagging to generate a tagged dataset!.

Now, since this is a supervised algorithm, we need to get some labels from "expert" users. These labels will be used to train the algorithm to produce predictions. The task for the users will be simple: assign one of the following letters to each token: { o, d, s, p, f, n }. While there are online tools to perform this task, I decided to go more old school with a simple CSV file with a format more or less like this:

Offer Id Token Position POS Label
0 ¡ 0 faa [USER LABEL]
0 CUN 1 np00000 [USER LABEL]
0 a 5 sp000 [USER LABEL]
0 Ámsterdam 7 np00000 [USER LABEL]
0 $ 17 zm [USER LABEL]
0 8,960 18 dn0000 [USER LABEL]
0 ! 23 fat [USER LABEL]
0 Sin 25 sp000 [USER LABEL]
0 escala 29 nc0s000 [USER LABEL]
0 en 36 sp000 [USER LABEL]
0 EE.UU 39 np00000 [USER LABEL]

Where the values of the column marked with [USER LABEL] should be defined by the expert users who will help us in labelling our data.

from tqdm.notebook import trange, tqdm
import csv

path_for_data_to_label = "data/to_label.csv"

with open(path_for_data_to_label, "w") as w:
    writer = csv.writer(w)
    writer.writerow(['offer_id', 'token', 'position', 'pos_tag', 'label'])

    for offer_id, row in tqdm(vuelos.iterrows(), total=len(vuelos)):
        tokens, positions = index_emoji_tokenize(row["label"], return_flags=True)
        tags = spanish_postagger.tag(tokens)
        for  token, position, (_, pos_tag) in zip(tokens, positions, tags):
            writer.writerow([
                offer_id,
                token,
                position,
                pos_tag,
                None
            ])

The file that needs to be labelled is located at data/to_label.csv.

Can we make this easy? I have gone through the "pains" of labelling some data myself; the labels are stored in the file data/to_label-done.csv.

Visit the next post in the series: Other feature extraction. In the meantime, I hope this post has shed some light on how to use the StanfordPOSTagger; feel free to ask some questions if you have them by leaving a comment here or contacting me on twitter via @io_exception.

Go Top