Sunday, September 27, 2009

Blogosphere research issues

This post summarizes the research issues section of Nitin Agrwal and Huan Kui: “Blogosphere: Research Issues, Tools, and Applications”, in the Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008.

1-Modeling the Blogosphere

This issue cares about the models that best describes the structure and properties of the blogosphere in order to gain a deeper insights into the relationships between bloggers, commenters, blog posts, comments, viewers/readers, and different blog sites in the blogosphere. Modeling the blogosphere is often associated with modeling the web. Researchers represent the web as a webgraph, where each webpage forms a node and hyperlinks between them as edges. This kind of representation results in a directed cyclic graph. Weights can be associated with these edges. Such a model that converts the web into a graphic model is extensively exploited. More about web graph can be found in [1], and I might talk about it in a later post. So, why don’t we model the blogosphere as a graph?
First, models developed for the web assumes a dense graph structure due to a large number of interconnecting hyperlinks within web pages. This assumption does not hold true in the blogosphere, since the hyperlink structure in the blogosphere is very sparse, as shown in. Second, the level of interaction in terms of comments and replies to a blog post makes the blogosphere different from the web. Third, the highly dynamic and \short-lived" nature of the blog posts could not be simulated by the web models. Web models do not consider this dynamicity in the web pages. They assume web pages accumulate links over time. However, in a blog network, where blog posts are the nodes, it is impractical to construct a static graph like the one for the web. These differences necessitate the need for a model more towards the characteristics of the blogosphere.
This has motivated researchers to come up with models specific to the blogosphere.
For example, Leskovec et al. [2] studied the temporal patterns of the blogosphere like how often people create blog posts, burstiness and popularity, how these blog posts are linked, and what is the link density. They reported that these phenomena follow power law distributions and they managed to create a blog network.
Kumar et al. [3] use the blogrolls given on a blog post to create a network of connected posts with the underlying assumption that blogrolls have links to related or similar blog posts.
A lot of research has been conducted that posits a known network structure of the blogosphere to model the problem domain. Such models are specific to problem domains.

2- Blog Clustering

A lot of research is going on to automatically cluster different blogs into meaningful groups such that readers can focus on interesting categories, rather than filtering out relevant blogs from the jungle. Often blog sites allow their users to provide tags to the blog posts. The human labeled tag information forms the so-called “folksonomy".
Brooks and Montanez [4] presented a study where the a study where the human labeled tags are good for classifying the blog posts into broad categories while they were less effective in indicating the particular content of a blog post. They used the tf-idf measure to pick the top three most famous words in every blog post and computed the pairwise similarity among all the blog posts and clustered them. They compared the results with the clustering obtained using the human labeled tags and reported significant improvement. It was also found that keywords – based clustering suffers from the high dimensionality and sparsity . Agarwal et al. [5] proposed WisClus that uses the collective wisdom of the bloggers to cluster the blogs. They have used the blog categories and construct the category relation graph to merge different categories and cluster the blogs that belong to these categories. Edges in the category relation graph represent the similarity between different categories which are the nodes in this graph. Their results showed that WisClus is better than keywords-based clustering.
I am going to talk about WisClus in details in a later post.

3- Blog Mining

Blogs are immensely valuable resources to track consumers' beliefs and opinions, initial reaction to a launch, understand consumer language, track trends and buzzwords,fine tune information needs. Blog conversations leave behind the trails of links, useful for understanding how information flows and how opinions are shaped and influenced. Tracking blogs also help in gaining deeper insights as bloggers share their views from various perspectives hence giving a 'context' to the information collected.
A prototype system called Pulse [19] uses a Naijve Bayes classifier trained on manually annotated sentences with positive/negative sentiments and iterates until all unlabeled data is adequately classified. Another system presented in [5] improves the blog retrieval by using opinionated words acquired from WordNet in the query proximity. More about WordNet and Naïve Bayes in later posts.

4-Community Discovery and Factorization

One method that researchers commonly use is content analysis and text analysis of the blog posts to identify communities in the blogosphere. An alternative approach in identifying communities in web using a hub and authority based approach, clustering all the expert communities together by identifying them as authorities. More about hub and authority (http://www.urlanalysis.info/hubs.asp). While Chin and Chignell [6] proposed a model for finding communities taking the blogging behavior of bloggers into account, they aligned behavioral approaches in studying community with the network and link analysis approaches. I am going to talk about this paper in details later.

5- Filtering Spam Blogs (a.k.a. splogs)

Besides degrading search quality results splogs also wastes the network resource. Some researchers consider spam blogs detection is a case of web spam. But there are some critical differences between web spam detection and splog detection. The content on blog sites is very dynamic as compared to that of web pages, so content based spam filters are ineffective. Moreover, spammers can copy the content from some regular blog posts to evade content based spam filters. Link based spam filters can easily be beaten by creating links pointing to the splogs.


Two more research issues were presented ; Influence in Blog and propagation and Trust and Reputation which are not currently essential to our work.

References


[1]Ravi Kumar, Parpnhakar Raghavan, Siridhar Rajagopalan, D.Sivakumar, Andrew Tompkins, Eli Upafal, “The web as a graph”, in the Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2000.



[2] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs. In SIAM International Conference on Data Mining, 2007.



[3] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the Bursty Evolution of Blogspace. In Proceedings of the 12th international conference on World Wide Web, pages 568{576, New York, NY, USA, 2003. ACM Press.



[4] Christopher H. Brooks and Nancy Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 625{632, New York, NY, USA, 2006. ACM Press.



[5] Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar Subramanya. Clustering blogs with collective wisdom. In Proceedings of the International Conference on Web Engineering, 2008.



[6] Alvin Chin and Mark Chignell. A social hypertextmodel for finding community in blogs. In HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia, pages 11{22, New York, NY,USA, 2006. ACM Press.

No comments:

Post a Comment