Social Media Tokenizer: English
This application tokenizes English social media posts correctly handling the use of punctuation symbols within hashtags, emoticons, user mentions etc. Whilst originally designed for processing Tweets it can also handle text from other social media platforms such as Facebook and Instagram that use similar conventions.
The pipeline is a cut down version of our English Named Entity Recognizer for Tweets (TwitIE)
- An annotation spanning a single Tweet
- The annotation that covers each individual word
- the type of token; word, punctuation, number etc.
- the length, in characters, of the token
- the text of the token
- The username part of @user mentions, not including the leading @ sign
- the username, not including the leading @ sign
- Emoticons such as :-)
- the normalized form of the emoticon, i.e. the value for both :) and :-) is :)
- Used to annotate any URL occuring within the tweet
- the URL address
- Hashtags, including the leading # character
Use this pipeline
You can process up to 1,200 documents per day free of charge using the REST API, at an average rate of 2 documents/sec. Higher quotas are available for research users by arrangement, contact us for details.
The API endpoint for this pipeline is:
You can process any amount of data with this pipeline on a pay-as-you-go basis, for GBP0.80 per hour. This can be data you upload yourself, data you collected from Twitter, or the results of a previous job.