Monday, September 28, 2009

Modeling bloggers' interests

Modeling bloggers' interests becomes an important blogs research issue as it can help construct the user profile. Most researchers focus only on one feature for modeling bloggers' interests such as the combined classifier to identify interests from single posts [1], and the usage of social hypertext [2]. But, a lot of features can be used to detect the interests, such as textual, temporal and interactive features [3]. Other features, such as comments and bloggers' communities, should be used to identify bloggers' interests but they are not introduced in this post.

Two interest models are introduced, STIM (short term interest model) and LTIM (long term interest model). The first model handles interests with weak stability, and the second model handles the long period with strong stability. The two models combine textual and temporal features.

Each model depends on a window of time, a lot of posts in a certain time period, which is used in updating the original interests' vector. Forgetting function can be used to calculate the coefficients used in determining the time window [4].

References:

[1] Xiaochuan Ni, et al. “Automatic Identification of Chinese
Weblogger Interests Based on Text Classification”. In WI’ 2006.

[2] Alvin Chin, et al. “A Social Hypertext Model for Finding
Community in Blogs”. In HT’06.

[3] Chun-Yuan Teng, et al. “Detection of Bloggers’ Interests:
Using Textual, Temporal, and Interactive Features”. In WI’2006

[4] Y. Cheng, G. Qiu2, J. Bu, K. Liu2, Ye Han3, C. Wang and C. Chen. Model Bloggers’ Interests Based on Forgetting Mechanism, WWW 2008.

How to use comments as a measure for understanding blog posts

Most of proposed researches about the blogosphere focus only on blog posts and ignore their comments in their analysis. Comments can be used as a measure of understanding their corresponding posts.

A lot of applications can benefit from comments-oriented summarization, such as blog search, blog representation, reader feedback and others. Among all blog posts that containing comments, an average of 6.3 comments per post was observed. Nevertheless, very few studies on blog comments and post summarization have been reported.

One of the described models is to extract representative sentences from a blog post that best represent the topics discussed among its comments. First, derive representative words from comments and then selects sentences containing representative words. A lot of word representativeness measures can be used such as binary and comment frequency, term frequency and ReQuT model, where ReQuT concerns about three aspects including (Reader, Quotation and Topic).

Three Observations are used as guidelines for the ReQut model.
• Observation 1: many readers mention other readers' names as a replay to them.
• Observation 2: a comment may contain quoted sentences
• Observation 3: discussions in comments often branches into several topics and a set of comments are linked together by sharing the same topic.

The previous representativeness measures can be defined as the following.
• Binary: if there is at least one comment or not.
• Comment frequency: number of comments containing a certain word (similar to document frequency).
• Term frequency: number of occurrences in all comments of the blog post.

But all of these three measures can suffer from spam comments. Other measures can be used like authors of comments and quotations among comments.

An equation is derived based on the previous measures with corresponding ratios to calculate the word representativeness score. Using human labeled sentences, a two sentence selection methods are used with these measures.

References:

M. Hu, A. Sun and E. Lim. Comments-Oriented Blog Summarization by Sentence Extraction. CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007.

Sunday, September 27, 2009

Graph Visulalization

The post summarizes chapter 3 "GRAPH VISUALIZATION AND DATA MINING" from Diane J. Cook and Lawrence B. Holder "MINING GRAPH DATA".
A graph drawing algorithm receives as input a combinatorial description of a graph and returns as output a drawing of the graph. In particular, the edges are represented as polygonal chains in a polyline drawing, as chains of alternating horizontal and vertical segments in an orthogonal drawing, and as segments in a straight-line drawing. one of the fundamental properties of a drawing is its readability, that is, the capability of conveying the information associated with the graph in a clear way. The benefits of graph drawing techniques for data mining purposes
include analysis through visual exploration, discovery of patterns and correlations, and abstraction and summarization. Social network analysts use graph-theoretic methodologies to
identify important actors, crucial links, subgroups, roles, and network characteristics
in such graphs. In this context, graph visualization is essential to make complicated
types of analysis and data handling, such as identification of recurrent patterns and
of relationships, more transparent, intuitive, and readily accessible
For example, the centrality role of an element in a collaboration network can be more easily detected by visual inspection of a radial drawing of the network itself that is, a drawing where vertices lie on concentric circles. Web Searching is a developing effective technologies for mining data from the World Wide Web is a very fertile research field. The effectiveness and the efficiency of most common search engines rely on the structural properties of the Web graph
(Web resources and their hyperlinks). Valuable insight into the link structure and
the page rank of the Web graph can be provided by automatic visualization systems. Also, a new
generation of Web search engines is being designed, where the data collected from the World Wide Web are clustered into meaningful semantic categories. The user interacts with the Web data by means of a graphical user interface that displays these categories organized as a tree structure or organized as a graph.

Three of the most popular graph drawing approaches, namely, the force-directed, the hierarchical, and the topology-shape-metrics approach:

Force-directed methods are quite popular for network analysis since they are easy to
implement and often produce aesthetically pleasing results. The basic idea is idea is to model a graph G as a mechanical system where the vertices of G are steel rings and the edges of G are springs that connect pairs of rings. To compute a drawing of G, the rings are placed in some initial layout and let go until the spring forces move the system to a local minimum of energy.

In the hierarchical approach hierarchies arise in a variety of applications, for example, Project Evaluation Review Technique (PERT) diagrams in project planning, class hierarchies in software engineering, and is-a relationships in knowledge representation systems. It is customary to represent hierarchical graphs so that all the edges flow in the same direction, for
example, from top to bottom or from left to right. Algorithms that compute drawings
of hierarchies generally follow a methodology first introduced by Sugiyama et al that accepts, as input, a directed graph G without any particular restriction (G can be planar or not, acyclic or cyclic) and that produces as output a layered drawing of G, that is, a representation where the vertices are on horizontal layers and the edges are monotone polygonal chains connecting their end vertices.

The topology-shape-metrics approach is one of the most effective techniques for
computing highly readable drawings of graphs. It represents graphs in the so-called orthogonal drawing standard, that is, each vertex of the graph is drawn as a point or a box and each edge is drawn as a polygonal chain of horizontal and vertical segments. Orthogonal drawings are widely used in many application fields, like very large scale integration (VLSI), software engineering, information systems, internet, andWeb computing. An example showing the topology-shape visualization approach.

Blogosphere research issues

This post summarizes the research issues section of Nitin Agrwal and Huan Kui: “Blogosphere: Research Issues, Tools, and Applications”, in the Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008.

1-Modeling the Blogosphere

This issue cares about the models that best describes the structure and properties of the blogosphere in order to gain a deeper insights into the relationships between bloggers, commenters, blog posts, comments, viewers/readers, and different blog sites in the blogosphere. Modeling the blogosphere is often associated with modeling the web. Researchers represent the web as a webgraph, where each webpage forms a node and hyperlinks between them as edges. This kind of representation results in a directed cyclic graph. Weights can be associated with these edges. Such a model that converts the web into a graphic model is extensively exploited. More about web graph can be found in [1], and I might talk about it in a later post. So, why don’t we model the blogosphere as a graph?
First, models developed for the web assumes a dense graph structure due to a large number of interconnecting hyperlinks within web pages. This assumption does not hold true in the blogosphere, since the hyperlink structure in the blogosphere is very sparse, as shown in. Second, the level of interaction in terms of comments and replies to a blog post makes the blogosphere different from the web. Third, the highly dynamic and \short-lived" nature of the blog posts could not be simulated by the web models. Web models do not consider this dynamicity in the web pages. They assume web pages accumulate links over time. However, in a blog network, where blog posts are the nodes, it is impractical to construct a static graph like the one for the web. These differences necessitate the need for a model more towards the characteristics of the blogosphere.
This has motivated researchers to come up with models specific to the blogosphere.
For example, Leskovec et al. [2] studied the temporal patterns of the blogosphere like how often people create blog posts, burstiness and popularity, how these blog posts are linked, and what is the link density. They reported that these phenomena follow power law distributions and they managed to create a blog network.
Kumar et al. [3] use the blogrolls given on a blog post to create a network of connected posts with the underlying assumption that blogrolls have links to related or similar blog posts.
A lot of research has been conducted that posits a known network structure of the blogosphere to model the problem domain. Such models are specific to problem domains.

2- Blog Clustering

A lot of research is going on to automatically cluster different blogs into meaningful groups such that readers can focus on interesting categories, rather than filtering out relevant blogs from the jungle. Often blog sites allow their users to provide tags to the blog posts. The human labeled tag information forms the so-called “folksonomy".
Brooks and Montanez [4] presented a study where the a study where the human labeled tags are good for classifying the blog posts into broad categories while they were less effective in indicating the particular content of a blog post. They used the tf-idf measure to pick the top three most famous words in every blog post and computed the pairwise similarity among all the blog posts and clustered them. They compared the results with the clustering obtained using the human labeled tags and reported significant improvement. It was also found that keywords – based clustering suffers from the high dimensionality and sparsity . Agarwal et al. [5] proposed WisClus that uses the collective wisdom of the bloggers to cluster the blogs. They have used the blog categories and construct the category relation graph to merge different categories and cluster the blogs that belong to these categories. Edges in the category relation graph represent the similarity between different categories which are the nodes in this graph. Their results showed that WisClus is better than keywords-based clustering.
I am going to talk about WisClus in details in a later post.

3- Blog Mining

Blogs are immensely valuable resources to track consumers' beliefs and opinions, initial reaction to a launch, understand consumer language, track trends and buzzwords,fine tune information needs. Blog conversations leave behind the trails of links, useful for understanding how information flows and how opinions are shaped and influenced. Tracking blogs also help in gaining deeper insights as bloggers share their views from various perspectives hence giving a 'context' to the information collected.
A prototype system called Pulse [19] uses a Naijve Bayes classifier trained on manually annotated sentences with positive/negative sentiments and iterates until all unlabeled data is adequately classified. Another system presented in [5] improves the blog retrieval by using opinionated words acquired from WordNet in the query proximity. More about WordNet and Naïve Bayes in later posts.

4-Community Discovery and Factorization

One method that researchers commonly use is content analysis and text analysis of the blog posts to identify communities in the blogosphere. An alternative approach in identifying communities in web using a hub and authority based approach, clustering all the expert communities together by identifying them as authorities. More about hub and authority (http://www.urlanalysis.info/hubs.asp). While Chin and Chignell [6] proposed a model for finding communities taking the blogging behavior of bloggers into account, they aligned behavioral approaches in studying community with the network and link analysis approaches. I am going to talk about this paper in details later.

5- Filtering Spam Blogs (a.k.a. splogs)

Besides degrading search quality results splogs also wastes the network resource. Some researchers consider spam blogs detection is a case of web spam. But there are some critical differences between web spam detection and splog detection. The content on blog sites is very dynamic as compared to that of web pages, so content based spam filters are ineffective. Moreover, spammers can copy the content from some regular blog posts to evade content based spam filters. Link based spam filters can easily be beaten by creating links pointing to the splogs.


Two more research issues were presented ; Influence in Blog and propagation and Trust and Reputation which are not currently essential to our work.

References


[1]Ravi Kumar, Parpnhakar Raghavan, Siridhar Rajagopalan, D.Sivakumar, Andrew Tompkins, Eli Upafal, “The web as a graph”, in the Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2000.



[2] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs. In SIAM International Conference on Data Mining, 2007.



[3] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the Bursty Evolution of Blogspace. In Proceedings of the 12th international conference on World Wide Web, pages 568{576, New York, NY, USA, 2003. ACM Press.



[4] Christopher H. Brooks and Nancy Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 625{632, New York, NY, USA, 2006. ACM Press.



[5] Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar Subramanya. Clustering blogs with collective wisdom. In Proceedings of the International Conference on Web Engineering, 2008.



[6] Alvin Chin and Mark Chignell. A social hypertextmodel for finding community in blogs. In HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia, pages 11{22, New York, NY,USA, 2006. ACM Press.