WebTag system design
Introduction
The WebTag module should enable users to add free-form tags to records (and maybe other objects).
Official task
.
Use cases
Enriching the article metadata
According to a
survey
, the majority of users were willing to add tags to articles as "to give a service to the community". These should be the primary target of the tagging interface, because tags are a powerful classification scheme and enable thematic searches and navigation which can be more accurate than free-text search.
User goal: Altruistic enriching of metadata.
Our goal: Improve keyword list and search/navigation.
Prerequisites: Simple interface to edit tags.
Tagging for self-organization
Tags can easily be used for self-organization, meaning applying text strings according to some scheme which is relevant to the person without necessarily being included in our dictionary. These can be completely personal ("to read", "mine"), microformat or other special formatted tags ("geo:12.34;56.78", "rel:colleague"), new HEP terms ("Unparticle physics", "Quantum string theory"), or anything else according to their fancy. The key is that it should be possible for a user to show all records tagged with a tag (like
Delicious
) and other related navigations.
Both forms of tagging can result in added value for the entire set of records, since we can look at tags used to detect useful terms, and put those in the dictionary.
User goal: Easier navigation of personally relevant records.
Our goal: Learn from usage / tagging patterns.
Prerequisites: After saving tags, the web page should output each of them as a link to a page showing all records that the current user has added the tag to. The user should have access to an overview of her tags.
Architecture
- Primary interface shown in record Information (main) tab
cds-invenio/modules/webstyle/lib/webstyle_templates.py
.
- XHTML output.
- Check out hidden feature in the regression test suite to test all web pages with W3C validator.
- Perhaps use
cds-invenio/modules/websearch/lib/search_engine.py
to save tags; that way we could set @action='.'
+ a hidden field action=save_tags
, and just return to the page with the results of the save.
API
Code
Views
View tags on record
- Tags link to list of all records tagged with that value for the current user
Edit tags on record
- Input box should either expand dynamically or use
overflow: scroll;
to be able to see the entire tags.
View all tags
Like
tags on record, but without filtering on record ID.
View all records tagged with X
Use search interface.
Relation to other modules
- Plugs into bibclassify web interface.
- Should make sure that the word "tag" is not used for other things in Invenio and INSPIRE, to avoid confusion.
Whitespace
It is useful to be able to have whitespace in tags, but it could easily lead to confusion - There are
lots
of spaces, and distinguishing between them is not at all trivial on a computer screen. The current code tries to simplify the issue by using the following rules (based on the Python
string.whitespace
value), applied roughly in this sequence:
- Any newline (\r, \n) or combinations thereof separates two tags. This allows us to support the three major platforms, Windows (\r\n), Macintosh (\r) and Linux (\n). E.g. "foo\n\r\nbar" (without the quotes) is treated as two tags, "foo" and "bar".
- Any other whitespace is translated into Unicode code point U+0020
, the standard space character in ASCII and on normal keyboards. E.g., "foo<Tab character>bar" → "foo bar".
- Any sequence of more than one space inside a tag is changed to a single space.
- Any space at the start or end of a tag is removed. E.g., " foo " → "foo".
- Any string which only contains spaces at this point is discarded.
This is of course based on the assumption that different spaces do not add semantic value to the tags in a field like HEP, and are purely useful as word separators. If this assumption is wrong (e.g., if Tab characters are semantically different from spaces in some taxonomy entries), the code will have to be revisited.
Ideas
Are tags the same as keywords?
"In corpus linguistics a keyword is a word which occurs in a text more often than we would expect to occur by chance alone. Keywords are calculated by carrying out a statistical test (e.g., loglinear) which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use." <http://en.wikipedia.org/wiki/Keyword_(linguistics)>
"An index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records." <http://en.wikipedia.org/wiki/Index_term>
"In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information (such as an internet bookmark, digital image, or computer file). This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are chosen informally and personally by the item's creator or by its viewer, depending on the system." <http://en.wikipedia.org/wiki/Tag_(metadata)>
"key-word, ... in information-retrieval systems, any informative word in the title or text of a document, etc., chosen as indicating the main content of the document" <http://dictionary.oed.com/cgi/entry/50126289/50126289se86?single=1&query_type=word&queryword=keyword&first=1&max_to_show=10&hilite=50126289se86>
tag, n. 8. e. "A character or set of characters appended to an item of data in order to identify it." <http://dictionary.oed.com/cgi/entry/50246105?query_type=word&queryword=tag&first=1&max_to_show=10&sort_type=alpha&result_place=1&search_id=Sbw1-rrtOnU-2364&hilite=50246105>
So tags seem to be for identification, while keywords are important words
in the text. -> Should be separate
Tags should be without structure. Remember that if taken to the extreme, all metadata could be put into tags, making correct usage extremely complex.
Auto-suggest input
Get keywords using
select distinct(value) from bib65x where tag = "6531_a";
Any auto-suggestion feature should be context aware in order to give the most useful answer. It should suggest tags which are relevant for:
- The user editing the tags
- The record at hand
- The tags already entered for the current record
- The text that the user has already entered as part of the current tag
This leads to an obvious set of input to the server whenever the client requests tag suggestions:
- The user ID
- Suggestions can include any public keywords and any tags used by the current user
- The record ID
- Suggestions could give preference to tags which appear in the record text or metadata
- The value of the tag input field (or generally, fields, but this is not applicable for the current solution)
- Only tags which are not already present in full in the input field should be suggested
- The position of the cursor in the tag input field (if applicable)
- Suggestions should be available before starting to type, to familiarize the user with the interface and typical tags, and to speed up the tagginghttp://wordreference.com/
- Could be generated offline, but would scale with records × users
Moderation
Slashdot
has over a million users, and an IMO very successful moderation system, considering that anyone can register and comment. Some lessons we could learn (from their
FAQ
and my own experience):
- Don't let anyone have absolute moderation power. I.e., no matter how many moderation points you have, you can only moderate each tag (in our case) once.
- Give a few "editor" users infinite moderation points from the beginning.
- Users which have positive "karma" get moderation points once in a while
- Karma is upper bound to ensure that nobody can get so high that they can start abusing the system without losing it quickly.
- Karma is lower bound to encourage "bad" users to get better, since they can't dig themselves too badly down.
- Moderation points expire after some period (shorter than the renewal period for the moderation points), to encourage regular logins and use of the moderation system.
- Meta-moderation serves to amplify or dampen the moderation points
- Tokens are allocated periodically to users according to karma, and moderation points are given when the token count goes above a limit (and the token count is reset).
- Meta-moderation is based on the same system, with a higher threshold. Meta-moderation results amplify or dampen / negate the moderation results of the tag and affects the karma of the original moderator in the same way.
Own ideas:
- Slashdot has the additional limitation that you cannot comment on and moderate comments on the same article, presumably to avoid "personal" votes. I'm not sure this is a good idea for us, since we could lose a good deal of good tagging that way. Of course, we would have to make sure moderators don't get to moderate their own tags. Also, we could have a higher karma limit to be able to moderate on "touched" records.
- Since our database contains a lot of very advanced material, it might be a good idea to get people to (meta-)moderate the fields they know well, based perhaps on the ratings of their tags in the fields.
- Moderators should not be able to see who added the tags to the record
- Maybe we could just jumble together all the tags from all the users on a record in a long list, and let the moderators decide which ones they want to moderate. That way the really good tags would get lots of positive votes, and the really bad ones would get lots of negative votes.
- Should we allow users to moderate records that they are involved in professionally? The danger of disallowing it is that users might try to hide such a link, which would probably be easy. The danger of allowing it is that users might vote according to personal preference instead of professional quality.
- Should moderators be free to pick which tags / articles they moderate, or should we select for them? If we selected, the possibility of personal moderation would be diminished, but I think participation would be lower because we couldn't possibly detect reliably which articles / tags they would want to moderate.
- Moderation points should be reduced according to number of tags, to avoid spammers which get a few good votes to get a good rating.
- Public tags should be excluded from moderation, to encourage new tags and to avoid giving points for simply confirming what other users and inputters have done. This would also help against harvesting spam bots.
- Table structure - See Slash code repository
Meeting 2009-12-11
Metaphors
- Tagging and baskets are similar metaphors, leading potentially to the same implementation.
- Mostly UI difference.
- Use one or both?
- Expect that older (or more published) users would prefer baskets, and younger (or less published) users tags
- Are metaphors swappable? Create baskets from tags?
- Choice of metaphor can influence adoption per se, way is used, and usefulness for classification
- Auto-completion from taxonomy in the UI where people are tagging might be a proxy to get help for classification (could build machine learning on top of it)
Note: 4b=basket.briefcase.bookshelf.backpack
What the user sees in the tagging metaphor
- Text box in the splash page of each paper (or on the list of results). Add a tag, which effectively generates a "basket" of the same name.
- Auto-complete and/or auto-suggest from taxonomy.
- List of relevant tags - Click to add / remove from tags.
- "Null tag" is a possibility to "add to your standard 4b"
What the user sees in the basket metaphor
- "Add to my 4b" button.
- List of basket names - Click to add / remove from basket.
- Page where the add-on is processed with:
- tree of all existing 4b's if more than one, to choose the relevant one
- text box to enter the tags
Meeting 2009-12-16
Results
Cost-benefit analysis of optimizing tagTAG table
Tibor suggested to use
int (4 bytes, ~4 billion values) instead of
bigint (8 bytes, ~18 quintillion values) for the
id column and
mediumint (3 bytes, ~16 million values) instead of
bigint for
position to save disk space. If we implement
ticket 166
we should change the position to be whatever can fit this maximum number - Changing it upwards later is not much of a problem, and since users won't be allowed to go beyond the limit we won't have to worry about the system breaking without warning. For
id, changing to using
int might
save 4 bytes per tag, with the risk that anyone with a script could saturate the ID space within minutes or hours.
See also
Trac tickets