Tuesday, December 20, 2005
Anjo Anjewierden: What is a Topic?
Because it is the holliday season, Anjo Anjewierden asks what is the topic ?. Now anybody who has ever attended a party (which are seasonally superabundant at the moment) and casually joins in a conversation, knows that it may take a while to actually find out. One reason is ofcourse that some conversations (especially at parties) serve no other purpose than giving the participants the idea that they belong to the same group, or, like the more flirtatious ones, that the participant(s) want to raise the level of intimacy. Thus, if you are an outsider to the group, you may come to the conclusion people are discussing work, when what they really do is building on the shared experience they have, to stress their commonality and shared interest. To some extend that seems to be the case for blogs as well, especially if they address a community that is somewhat interacting. However at least I would already be quite happy when we can detect the more superficial topic automatically.
Lilia equates the topic with the tag you would be putting on document like a blog. She has a point, but I would think that a tag's primary purpose is to make it possible to find something back by generating enough and sufficiently general associations (mirroring the precission recall dichotomy in information retrieval). On the other hand I think that, when asked the question "what are you talking about" I think they will try to give characterisation that is sufficiently precise in a given context. In a sense this may explain why Anjo is succesful with a simple information retrieval measure like TF/IDF, which is after all designed to find terms that are distinguished in a document compared to other documents (the context !). Without having read the full conversation a word like Blogwalk, Sigmund or Skype, may be quite enough as a characterisation. On the other hand Lilia, who has been engaged in the conversations, has more of a context. She will therefore subconsciously remember similar dicussions and change her context correspondingly.
Now what do I mean by similar discussions? I wish I knew exacty, but lets take the discussion on skype and pressence as an example. They both have high frequency and high TF/IDF in this discussion. In falling measure of TF/IDF score Anjo finds presence, Skype, communication, IM (= Instant Messaging, I guess), communication tool. Knowing Lilia, I know that she knows, that skype is a communication tool that supports instant messaging and presence, and that presence is the capability to tell whether you are avalailable for communication (so there you have ontological relations). Now what does it mean that Lilia remembers similar discussions. I don't know exactly ofcourse, because I don't really know everything she has read, and moreover the human brain is subtle. But let us suppose that using her ontological knowledge she will first "score" a hit for communication tool for every mention of skype, and score a "bit of a hit" for presence and quite a bit for communication (because this subject is dear to her hart). This puts communication tool, presence and Instant Messaging higher up the list. Again knowing Lilia, I know that she read blogged and talked about these subjects. Thus, consiously or subconsciously, I think she subconsciously changed her context (if you want to think in IR terms, changed the "document collection in which to compute TF/IDF scores) and asked herself what is it that characterised this dicussion in this more specialised context. In this context (and her social and blog circle) skype, IM and presence are not so characteristic anymore, so my guess (not having read the discussion) is that she comes up with a characterisation like
presence with skype
or likely more specific characterisations such as
How do you switch on presence in skype
or
presence in skype is really great/awful/mediocre
or
skype now does presence too !
or
......
It is thus not so surprising that she wants to have a Sigmund type cooccurence analysis of the conversation, because cooccurrence tends to emphasize the relations between terms rather than just the terms themselves, although being a statistical method, it cannot really see what kind of semantic relations may underly observed cooccurences.
Now I hope I know Lilia well enough, that my belief that she does not mind me blogging about what she thinks, is true :-).
Lilia equates the topic with the tag you would be putting on document like a blog. She has a point, but I would think that a tag's primary purpose is to make it possible to find something back by generating enough and sufficiently general associations (mirroring the precission recall dichotomy in information retrieval). On the other hand I think that, when asked the question "what are you talking about" I think they will try to give characterisation that is sufficiently precise in a given context. In a sense this may explain why Anjo is succesful with a simple information retrieval measure like TF/IDF, which is after all designed to find terms that are distinguished in a document compared to other documents (the context !). Without having read the full conversation a word like Blogwalk, Sigmund or Skype, may be quite enough as a characterisation. On the other hand Lilia, who has been engaged in the conversations, has more of a context. She will therefore subconsciously remember similar dicussions and change her context correspondingly.
Now what do I mean by similar discussions? I wish I knew exacty, but lets take the discussion on skype and pressence as an example. They both have high frequency and high TF/IDF in this discussion. In falling measure of TF/IDF score Anjo finds presence, Skype, communication, IM (= Instant Messaging, I guess), communication tool. Knowing Lilia, I know that she knows, that skype is a communication tool that supports instant messaging and presence, and that presence is the capability to tell whether you are avalailable for communication (so there you have ontological relations). Now what does it mean that Lilia remembers similar discussions. I don't know exactly ofcourse, because I don't really know everything she has read, and moreover the human brain is subtle. But let us suppose that using her ontological knowledge she will first "score" a hit for communication tool for every mention of skype, and score a "bit of a hit" for presence and quite a bit for communication (because this subject is dear to her hart). This puts communication tool, presence and Instant Messaging higher up the list. Again knowing Lilia, I know that she read blogged and talked about these subjects. Thus, consiously or subconsciously, I think she subconsciously changed her context (if you want to think in IR terms, changed the "document collection in which to compute TF/IDF scores) and asked herself what is it that characterised this dicussion in this more specialised context. In this context (and her social and blog circle) skype, IM and presence are not so characteristic anymore, so my guess (not having read the discussion) is that she comes up with a characterisation like
presence with skype
or likely more specific characterisations such as
How do you switch on presence in skype
or
presence in skype is really great/awful/mediocre
or
skype now does presence too !
or
......
It is thus not so surprising that she wants to have a Sigmund type cooccurence analysis of the conversation, because cooccurrence tends to emphasize the relations between terms rather than just the terms themselves, although being a statistical method, it cannot really see what kind of semantic relations may underly observed cooccurences.
Now I hope I know Lilia well enough, that my belief that she does not mind me blogging about what she thinks, is true :-).
Comments:
© Copyright 2004-2006 Rogier Brussee.
Phew !
Clearly your context is much richer than a set of documents, because you have a life. You have sensoric experiences and you talk to people. But that extra context is hard to feed in tools like toko :), but using some ontological information is a (rather poor) substitute. Indeed you are right that in information retrieval the document set you are looking at serves as the context. In particular the TF/IDF scores depend on this "background" documentset. Therefore, if you want to play information retrieval tricks that mimmick the reevalution of the topic of a subject based on interpreting a a another sub set of documents (such as a conversation) in a more focussed context you are stuck with doing a query on a document database, and redefine the result set as the document set you compute TF/IDF scores against.
Of course you have to make the right query to get a useful more focussed context, so you first have to make an initial guess about what the document is about by judging the relevance of terms against an unfocussed background of documents.
Come to think about it, I think that we all have the experience that we get to see less and less of the picture as we get to know more. Apparantly our own mental focussing mechanism can easily be tricked into bringing in ever more irrelevant detail (often from our newly acquired knowledge) that makes it harder and harder to discern the really important topic, that an outsider not hindered by knowledge might spot immediately.
Post a Comment
Clearly your context is much richer than a set of documents, because you have a life. You have sensoric experiences and you talk to people. But that extra context is hard to feed in tools like toko :), but using some ontological information is a (rather poor) substitute. Indeed you are right that in information retrieval the document set you are looking at serves as the context. In particular the TF/IDF scores depend on this "background" documentset. Therefore, if you want to play information retrieval tricks that mimmick the reevalution of the topic of a subject based on interpreting a a another sub set of documents (such as a conversation) in a more focussed context you are stuck with doing a query on a document database, and redefine the result set as the document set you compute TF/IDF scores against.
Of course you have to make the right query to get a useful more focussed context, so you first have to make an initial guess about what the document is about by judging the relevance of terms against an unfocussed background of documents.
Come to think about it, I think that we all have the experience that we get to see less and less of the picture as we get to know more. Apparantly our own mental focussing mechanism can easily be tricked into bringing in ever more irrelevant detail (often from our newly acquired knowledge) that makes it harder and harder to discern the really important topic, that an outsider not hindered by knowledge might spot immediately.
These are my personal views and do not necessarily reflect those of my employer.