The text of a tweet may only have a maximum of 140 characters, but there is a lot of metadata to be extracted from it. Here is an example tweet:


@james Agree! RT@abc fed gov should spend more on welfare in north states @smith @john #economy – sent by @smith, 1st Jan 2008


There are 3 types of tokens that can be identified:

  1. RT@ – Retweet: repeats a tweet from the user specified by @. The username is not part of the retweet
  2. @username -User identifier. Every user mentioned in the tweet will see it in their twitter feed
  3. # – Hashtag: used to tag tweets for topical searches
  4. words – all single words
  5. phrases – all 2-word and 3-word phrases

With the tokens identified, we need rules to perform lexical analysis on the tweet.


Establishing rules


1. Usernames


The placement of the @username is important. When placed at the start of the tweet, this indicates a directed tweet (similar to a To field in email). When placed at the end of the tweet, this indicates a copied tweet (similar to a CC field in email). These users can be grouped as Receivers (To) and Copied (CC) users.


For tweets with RT, the first @username is taken as RT User. Subsequent RT@usernames will not be taken as Receivers, Copied Users or RT Users. It is also important to note that the RT user is usually omitted from the To or CC portions of the tweet, to save on space. In the example tweet, @smith agrees with @abc even though @abc is not in front.


The placement of @james indicates that @smith was forwarding the tweet to @james, telling him he agreed with @abc’s opinion. This is similar to carbon copying, but the placement of james in front can imply that @smith was directing the message to him. When users copy other users in this style, the meaning becomes ambiguous. For research  purposes @james will be treated as a Receiver, based on the established rules.


The set of all usernames (minus the Sender, unless explicitly in the tweet) will be labeled as Mentioned users.


2. Text and Retweets


One problem with the RT token is separating the retweet from text added on by the user. In our example, the hashtag #economy may have come from the retweet or the user @smith. Any usernames appearing at the end of a retweet may also have been additions.


Due to the 140 character limit, the Sender may re-edit the RT and add comments at the end. As there is no guaranteed way to separate retweet from tweet, we will assume that all text after RT@username is part of the retweet.


3. Words and phrases


This is quite straightforward. The text is split by spaces, and punctuation is retained. This is to avoid processing text like ‘Agree!’ as phrases as the exclamation mark ends the sentence. So ‘Agree’ is stored as a word, but phrases start from ‘fed govt..’ onwards.


For 2-word and 3-word phrases, we start at the first word of each sentence and build up all possible combinations until we hit a breakpoint (exclamation, comma, colon, semi-colon, full stop).


For any sentence with n words, there are x possible 2-word phrases and y possible 3-word phrases, where:


x = n – 1

y = n – 2


Case of a word is retained. This is due to the ambiguous nature of some words that can also refer to places (e.g. Tenang vs. tenang). From past experience in Hulu Selangor and Sibu by-elections, users tend to ignore case. However that is when we know what words to look for. Preserving case might help us pick up on names of people and places we hadn’t thought were popular.


To speed up our analytics we will have 2 sets of meta-data, one with case-preservation and one converted to lower-case. When we pickup on high occurrences of a word in the lower-case set we can cross-reference with the other set.


Tokenizing Rules


  1. 1. All @usernames at the start of the tweet => grouped as Receivers
  2. If there is an RT:a) Anything to the left of RT@username => Sender textb) Anything to the right of RT@username => Retweet
  3. All @usernames at the end of the tweet, if there is no RT => Copied Users
  4. All #hashtags => grouped as Hashtags
  5. All @usernames (excluding Sender, unless explicitly in text) => Mentioned Users
  6. Words and phrases obtained by splitting the tweet by spaces, then building up phrases with punctuation marks as breakpoints

Getting the meta-data

With rules in hand, lets split the tweet into its meta-data:

@james Agree! RT@abc fed gov should spend more on welfare in north states @smith @john #economy – sent by @smith, 1st Jan 2008

Receiver: @james

Sender text: Agree!

Retweet: RT @abc fed gov should spend more on welfare in north states @smith @john #economy

RT User: @abc

Hashtag: #economy

Mentioned Users: @abc @james @john @smith

Sender: @smith

Creation Date: 1st Jan 2008


Words: Agree, fed, gov, should, spend, more, on, welfare, in, north, states

2-word Phrases: fed gov, gov should, should spend, spend more, more on, on welfare, welfare in, in north, north states

3-word Phrases: fed gov should, gov should spend, should spend more, spend more on, more on welfare, on welfare in, welfare in north, in north states


Ahmed Kamal

Published On: February 12th, 2011 / Categories: Theories / Tags: , , , /