Social Media Tokenizer: German

This application tokenizes German social media posts correctly handling the use of punctuation symbols within hashtags, emoticons, user mentions etc. Whilst originally designed for processing Tweets it can also handle text from other social media platforms such as Facebook and Instagram that use similar conventions.

This pipeline is a cut down version of our German Named Entity Recognizer for Tweets.

Tweet

An annotation spanning a single Tweet

Token

The annotation that covers each individual word

kind
the type of token; word, punctuation, number etc.

length
the length, in characters, of the token

string
the text of the token

UserID

The username part of @user mentions, not including the leading @ sign

user
the username, not including the leading @ sign

Emoticon

Emoticons such as :-)

normalized
the normalized form of the emoticon, i.e. the value for both :) and :-) is :)

URL

Used to annotate any URL occuring within the tweet

string
the URL address

Hashtag

Hashtags, including the leading # character

1,200 free requests / day
Larger batches £0.80 / CPU hour

Use this pipeline

Single documents
You can process up to 1,200 documents per day free of charge using the REST API, at an average rate of 2 documents/sec. Higher quotas are available for research users by arrangement, contact us for details.

The API endpoint for this pipeline is:

https://cloud-api.gate.ac.uk/process/german-tweet-tokenizer

Create API Key

Batches of documents
You can process any amount of data with this pipeline on a pay-as-you-go basis, for £0.80 per hour. This can be data you upload yourself, data you collected from Twitter, or the results of a previous job.

Reserve a job