English Tweet Tokenizer
TwitIE is a named entity recognition pipeline specially tuned for use with Twitter data. This pipeline is a cut down version of TwitIE which only performs tokenization of tweet data.
- Tweet
- An annotation spanning a single Tweet
- Token
- The annotation that covers each individual word
- kind
- the type of token; word, punctuation, number etc.
- length
- the length, in characters, of the token
- string
- the text of the token
- UserID
- The username part of @user mentions, not including the leading @ sign
- user
- the username, not including the leading @ sign
- Emoticon
- Emoticons such as :-)
- normalized
- the normalized form of the emoticon, i.e. the value for both :) and :-) is :)
- URL
- Used to annotate any URL occuring within the tweet
- string
- the URL address
- Hashtag
- Hashtags, including the leading # character
Use this pipeline
You can process up to 1,200 documents per day free of charge using the REST API, at an average rate of 2 documents/sec. Higher quotas are available for research users by arrangement, contact us for details.
The API endpoint for this pipeline is:
You can process any amount of data with this pipeline on a pay-as-you-go basis, for £0.80 per hour. This can be data you upload yourself, data you collected from Twitter, or the results of a previous job.