WebTag system design

Introduction

The WebTag module should enable users to add free-form tags to records (and maybe other objects). Official task.

Use cases

Enriching the article metadata

According to a survey, the majority of users were willing to add tags to articles as "to give a service to the community". These should be the primary target of the tagging interface, because tags are a powerful classification scheme and enable thematic searches and navigation which can be more accurate than free-text search.

User goal: Altruistic enriching of metadata.

Our goal: Improve keyword list and search/navigation.

Prerequisites: Simple interface to edit tags.

Tagging for self-organization

Tags can easily be used for self-organization, meaning applying text strings according to some scheme which is relevant to the person without necessarily being included in our dictionary. These can be completely personal ("to read", "mine"), microformat or other special formatted tags ("geo:12.34;56.78", "rel:colleague"), new HEP terms ("Unparticle physics", "Quantum string theory"), or anything else according to their fancy. The key is that it should be possible for a user to show all records tagged with a tag (like Delicious) and other related navigations.

Both forms of tagging can result in added value for the entire set of records, since we can look at tags used to detect useful terms, and put those in the dictionary.

User goal: Easier navigation of personally relevant records.

Our goal: Learn from usage / tagging patterns.

Prerequisites: After saving tags, the web page should output each of them as a link to a page showing all records that the current user has added the tag to. The user should have access to an overview of her tags.

Architecture

  • Primary interface shown in record Information (main) tab cds-invenio/modules/webstyle/lib/webstyle_templates.py.
  • XHTML output.
  • Check out hidden feature in the regression test suite to test all web pages with W3C validator.
  • Perhaps use cds-invenio/modules/websearch/lib/search_engine.py to save tags; that way we could set @action='.' + a hidden field action=save_tags, and just return to the page with the results of the save.

API

Code

Views

View tags on record

List of tags with public tags highlighted
  • Tags link to list of all records tagged with that value for the current user

Edit tags on record

Form to edit tags
  • Input box should either expand dynamically or use overflow: scroll; to be able to see the entire tags.

View all tags

Like tags on record, but without filtering on record ID.

View all records tagged with X

Use search interface.

Relation to other modules

  • Plugs into bibclassify web interface.
  • Should make sure that the word "tag" is not used for other things in Invenio and INSPIRE, to avoid confusion.

Whitespace

It is useful to be able to have whitespace in tags, but it could easily lead to confusion - There are lots of spaces, and distinguishing between them is not at all trivial on a computer screen. The current code tries to simplify the issue by using the following rules (based on the Python string.whitespace value), applied roughly in this sequence:
  1. Any newline (\r, \n) or combinations thereof separates two tags. This allows us to support the three major platforms, Windows (\r\n), Macintosh (\r) and Linux (\n). E.g. "foo\n\r\nbar" (without the quotes) is treated as two tags, "foo" and "bar".
  2. Any other whitespace is translated into Unicode code point U+0020, the standard space character in ASCII and on normal keyboards. E.g., "foo<Tab character>bar" → "foo bar".
  3. Any sequence of more than one space inside a tag is changed to a single space.
  4. Any space at the start or end of a tag is removed. E.g., " foo " → "foo".
  5. Any string which only contains spaces at this point is discarded.
This is of course based on the assumption that different spaces do not add semantic value to the tags in a field like HEP, and are purely useful as word separators. If this assumption is wrong (e.g., if Tab characters are semantically different from spaces in some taxonomy entries), the code will have to be revisited.

Ideas

Are tags the same as keywords?

"In corpus linguistics a keyword is a word which occurs in a text more often than we would expect to occur by chance alone. Keywords are calculated by carrying out a statistical test (e.g., loglinear) which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use." <http://en.wikipedia.org/wiki/Keyword_(linguistics)>

"An index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records." <http://en.wikipedia.org/wiki/Index_term>

"In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information (such as an internet bookmark, digital image, or computer file). This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are chosen informally and personally by the item's creator or by its viewer, depending on the system." <http://en.wikipedia.org/wiki/Tag_(metadata)>

"key-word, ... in information-retrieval systems, any informative word in the title or text of a document, etc., chosen as indicating the main content of the document" <http://dictionary.oed.com/cgi/entry/50126289/50126289se86?single=1&query_type=word&queryword=keyword&first=1&max_to_show=10&hilite=50126289se86>

tag, n. 8. e. "A character or set of characters appended to an item of data in order to identify it." <http://dictionary.oed.com/cgi/entry/50246105?query_type=word&queryword=tag&first=1&max_to_show=10&sort_type=alpha&result_place=1&search_id=Sbw1-rrtOnU-2364&hilite=50246105>

So tags seem to be for identification, while keywords are important words in the text. -> Should be separate

Tags should be without structure. Remember that if taken to the extreme, all metadata could be put into tags, making correct usage extremely complex.

Auto-suggest input

Get keywords using select distinct(value) from bib65x where tag = "6531_a";

Any auto-suggestion feature should be context aware in order to give the most useful answer. It should suggest tags which are relevant for:

  • The user editing the tags
  • The record at hand
  • The tags already entered for the current record
  • The text that the user has already entered as part of the current tag
This leads to an obvious set of input to the server whenever the client requests tag suggestions:
  • The user ID
    • Suggestions can include any public keywords and any tags used by the current user
  • The record ID
    • Suggestions could give preference to tags which appear in the record text or metadata
  • The value of the tag input field (or generally, fields, but this is not applicable for the current solution)
    • Only tags which are not already present in full in the input field should be suggested
  • The position of the cursor in the tag input field (if applicable)
    • Suggestions should be available before starting to type, to familiarize the user with the interface and typical tags, and to speed up the tagginghttp://wordreference.com/
      • Could be generated offline, but would scale with records × users

Moderation

Slashdot has over a million users, and an IMO very successful moderation system, considering that anyone can register and comment. Some lessons we could learn (from their FAQ and my own experience):
  • Don't let anyone have absolute moderation power. I.e., no matter how many moderation points you have, you can only moderate each tag (in our case) once.
  • Give a few "editor" users infinite moderation points from the beginning.
  • Users which have positive "karma" get moderation points once in a while
  • Karma is upper bound to ensure that nobody can get so high that they can start abusing the system without losing it quickly.
  • Karma is lower bound to encourage "bad" users to get better, since they can't dig themselves too badly down.
  • Moderation points expire after some period (shorter than the renewal period for the moderation points), to encourage regular logins and use of the moderation system.
  • Meta-moderation serves to amplify or dampen the moderation points
  • Tokens are allocated periodically to users according to karma, and moderation points are given when the token count goes above a limit (and the token count is reset).
  • Meta-moderation is based on the same system, with a higher threshold. Meta-moderation results amplify or dampen / negate the moderation results of the tag and affects the karma of the original moderator in the same way.

Own ideas:

  • Slashdot has the additional limitation that you cannot comment on and moderate comments on the same article, presumably to avoid "personal" votes. I'm not sure this is a good idea for us, since we could lose a good deal of good tagging that way. Of course, we would have to make sure moderators don't get to moderate their own tags. Also, we could have a higher karma limit to be able to moderate on "touched" records.
  • Since our database contains a lot of very advanced material, it might be a good idea to get people to (meta-)moderate the fields they know well, based perhaps on the ratings of their tags in the fields.
  • Moderators should not be able to see who added the tags to the record
    • Maybe we could just jumble together all the tags from all the users on a record in a long list, and let the moderators decide which ones they want to moderate. That way the really good tags would get lots of positive votes, and the really bad ones would get lots of negative votes.
  • Should we allow users to moderate records that they are involved in professionally? The danger of disallowing it is that users might try to hide such a link, which would probably be easy. The danger of allowing it is that users might vote according to personal preference instead of professional quality.
  • Should moderators be free to pick which tags / articles they moderate, or should we select for them? If we selected, the possibility of personal moderation would be diminished, but I think participation would be lower because we couldn't possibly detect reliably which articles / tags they would want to moderate.
  • Moderation points should be reduced according to number of tags, to avoid spammers which get a few good votes to get a good rating.
  • Public tags should be excluded from moderation, to encourage new tags and to avoid giving points for simply confirming what other users and inputters have done. This would also help against harvesting spam bots.
  • Table structure - See Slash code repository

Meeting 2009-12-11

Metaphors

  • Tagging and baskets are similar metaphors, leading potentially to the same implementation.
  • Mostly UI difference.
  • Use one or both?
    • Expect that older (or more published) users would prefer baskets, and younger (or less published) users tags
    • Are metaphors swappable? Create baskets from tags?
    • Choice of metaphor can influence adoption per se, way is used, and usefulness for classification
  • Auto-completion from taxonomy in the UI where people are tagging might be a proxy to get help for classification (could build machine learning on top of it)

Note: 4b=basket.briefcase.bookshelf.backpack

What the user sees in the tagging metaphor

  • Text box in the splash page of each paper (or on the list of results). Add a tag, which effectively generates a "basket" of the same name.
  • Auto-complete and/or auto-suggest from taxonomy.
  • List of relevant tags - Click to add / remove from tags.
  • "Null tag" is a possibility to "add to your standard 4b"

What the user sees in the basket metaphor

  • "Add to my 4b" button.
  • List of basket names - Click to add / remove from basket.
  • Page where the add-on is processed with:
    • tree of all existing 4b's if more than one, to choose the relevant one
    • text box to enter the tags

Meeting 2009-12-16

Results

Cost-benefit analysis of optimizing tagTAG table

Tibor suggested to use int (4 bytes, ~4 billion values) instead of bigint (8 bytes, ~18 quintillion values) for the id column and mediumint (3 bytes, ~16 million values) instead of bigint for position to save disk space. If we implement ticket 166 we should change the position to be whatever can fit this maximum number - Changing it upwards later is not much of a problem, and since users won't be allowed to go beyond the limit we won't have to worry about the system breaking without warning. For id, changing to using int might save 4 bytes per tag, with the risk that anyone with a script could saturate the ID space within minutes or hours.

See also

Trac tickets
Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg edit_interface_mockup.jpg r1 manage 144.8 K 2010-02-22 - 15:48 UnknownUser  
PDFpdf unresolved_questions.pdf r1 manage 85.1 K 2009-12-16 - 14:33 UnknownUser Resolved during the meeting
JPEGjpg view_interface_mockup.jpg r1 manage 128.4 K 2010-02-22 - 15:48 UnknownUser  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2010-07-14 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback