Saturday, October 10, 2009

Applying Social Network Analysis on Blogospheres

This post summarizes SUN Wen-jun and QIU Hang-ming: "A Social Network Analysis on Blogospheres", in the International Conference on Management Science & Engineering (15th), Long Beach, USA, September, 2008.

Blogging has become a way that makes people socially interact together. you can follow somebody's else stories, comment on them, and refer to them sometimes without any distance restrictions between you and him/her or any techniqual knowledge of using the internet. That's why the blogosphere is considered a manifestation of a social network and reflects a kind of social relationship between users. A social network is defined as a collection of nodes (social actors) and links between the nodes (the interconnections) and the social network analysis (SNA) only cares about the relationships between the actors without any concern with the actors themselves.

Graph theory is one of the foundations of SNA and the formulation of social networks consists of a social graphs and a social relation matrices.
Common concepts of social network analysis:

1- Degree

The degree of a node is the number of links to this nodes. In a directed graph, there is an in degree and an out degree. The out degree of node i is the number of links pointing out of this node, and the in degree is the number of links pointing to this node. If both the in-degree and out-degree are zero, ten the node is called an isolated node.

2- Geodesic path
The shortest path between two nodes.

3- Geodesic distance
The distance of the geodesic path d(i, j) .

4- Diameter
The distance of the longest geodesic path. D= Max {d(i, j)}

5- Density
A measure of the closeness of a network. The more links between a particular number of nodes, the larger the density of the network.


ρ= 2l / n *(n − 1) for directed graph

ρ= l / n *(n − 1) for undirected graph

where ρ is the density, l is the number of links, and n is the number of the nodes.

6- Power and centrality
Social scientists measure power from the prespective of "relationship".
The centrality is used to express the concept of power. Centrality tells what central role a node plays in a social network.

Links categories
In a Blogosphere, the individual blogs are represented as nodes in the graph and the inter-links between them are directional links in the graph.
There are four categories of links:
1- The link put by the BSP (Blog Service Provider); these links are present on all blogs under the same blog site.
2- The citation of other blogs intentionally put in the page by the owner of the blog.
3- Links to other blogs in a special fixed location on a blog's page layout.
4- Links left by visitor of a blog in reviews and comments.

Monday, October 5, 2009

Algorithm for ranking blog posts

Algorithms for ranking blog posts

First why do original ranking algorithms for web pages are not suitable for blog posts?

· Posts have their unique features different from those of web-pages which are like trackbacks, scraps, and comments.

· Posts are not rich in hyperlinks so normal ranking algorithms would be not enough for good ranking.

A Trackback is an action of writing a new post related to someone else’s post and putting a link to the original post within the new post

A scrap is an action of copying someone else’s post to one’s own blog

A comment is an action of putting a short opinion on someone else’s post.

Trackbacks and scraps are considered more important than comments

So how can this problem be solved? By making some modification on the original algorithms for web page ranking to be suitable for blog posts

Web page ranking algorithm is like InDegree, PageRank, and HITS

First we have to define a concept named as the web graph: each node for this graph is a web page and there is an edge from a node to another if the first page has a hyperlink referencing the second page.

First Algorithm and simplest is InDegree which gives the importance of a page by the number of pages referencing it.

PageRank, which is an algorithm used by Google , interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important".

HITS (Authorities and Hubs) : An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to

The update in blog posts ranking is that instead of the web graph it makes a bipartite graph which consists of two different types and called a post-blogger graph like the figure




Algorithms for ranking posts

InDegree is applied without modifications

AuthHub :

AuthHub is a new version of HITS modified for post ranking in blog environment. We note that both an authority score and a hub score are given to a web-page in original HITS. In AuthHub, however, an authority score is only given to a post, and a hub score is only given to a blogger. Thus, the ranks of posts are decided only by authority scores. An authority score of a post is a sum of hub scores of all the bloggers who do trackbacks or scraps on the post. Also, a hub score of a blogger is a sum of authority scores of the posts which the blogger do trackbacks or scraps on.

Conclusion of evaluation of modifications

After evaluation on a big dataset with taking opinions of users AuthHub performs better than original ranking algorithm and other modification to the ranking algorithms

Reference

[1] Post Ranking Algorithms in Blog Environment, Second International Conference on Future Generation Communication and Networking Symposia, 2008


Monday, September 28, 2009

Modeling bloggers' interests

Modeling bloggers' interests becomes an important blogs research issue as it can help construct the user profile. Most researchers focus only on one feature for modeling bloggers' interests such as the combined classifier to identify interests from single posts [1], and the usage of social hypertext [2]. But, a lot of features can be used to detect the interests, such as textual, temporal and interactive features [3]. Other features, such as comments and bloggers' communities, should be used to identify bloggers' interests but they are not introduced in this post.

Two interest models are introduced, STIM (short term interest model) and LTIM (long term interest model). The first model handles interests with weak stability, and the second model handles the long period with strong stability. The two models combine textual and temporal features.

Each model depends on a window of time, a lot of posts in a certain time period, which is used in updating the original interests' vector. Forgetting function can be used to calculate the coefficients used in determining the time window [4].

References:

[1] Xiaochuan Ni, et al. “Automatic Identification of Chinese
Weblogger Interests Based on Text Classification”. In WI’ 2006.

[2] Alvin Chin, et al. “A Social Hypertext Model for Finding
Community in Blogs”. In HT’06.

[3] Chun-Yuan Teng, et al. “Detection of Bloggers’ Interests:
Using Textual, Temporal, and Interactive Features”. In WI’2006

[4] Y. Cheng, G. Qiu2, J. Bu, K. Liu2, Ye Han3, C. Wang and C. Chen. Model Bloggers’ Interests Based on Forgetting Mechanism, WWW 2008.

How to use comments as a measure for understanding blog posts

Most of proposed researches about the blogosphere focus only on blog posts and ignore their comments in their analysis. Comments can be used as a measure of understanding their corresponding posts.

A lot of applications can benefit from comments-oriented summarization, such as blog search, blog representation, reader feedback and others. Among all blog posts that containing comments, an average of 6.3 comments per post was observed. Nevertheless, very few studies on blog comments and post summarization have been reported.

One of the described models is to extract representative sentences from a blog post that best represent the topics discussed among its comments. First, derive representative words from comments and then selects sentences containing representative words. A lot of word representativeness measures can be used such as binary and comment frequency, term frequency and ReQuT model, where ReQuT concerns about three aspects including (Reader, Quotation and Topic).

Three Observations are used as guidelines for the ReQut model.
• Observation 1: many readers mention other readers' names as a replay to them.
• Observation 2: a comment may contain quoted sentences
• Observation 3: discussions in comments often branches into several topics and a set of comments are linked together by sharing the same topic.

The previous representativeness measures can be defined as the following.
• Binary: if there is at least one comment or not.
• Comment frequency: number of comments containing a certain word (similar to document frequency).
• Term frequency: number of occurrences in all comments of the blog post.

But all of these three measures can suffer from spam comments. Other measures can be used like authors of comments and quotations among comments.

An equation is derived based on the previous measures with corresponding ratios to calculate the word representativeness score. Using human labeled sentences, a two sentence selection methods are used with these measures.

References:

M. Hu, A. Sun and E. Lim. Comments-Oriented Blog Summarization by Sentence Extraction. CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007.

Sunday, September 27, 2009

Graph Visulalization

The post summarizes chapter 3 "GRAPH VISUALIZATION AND DATA MINING" from Diane J. Cook and Lawrence B. Holder "MINING GRAPH DATA".
A graph drawing algorithm receives as input a combinatorial description of a graph and returns as output a drawing of the graph. In particular, the edges are represented as polygonal chains in a polyline drawing, as chains of alternating horizontal and vertical segments in an orthogonal drawing, and as segments in a straight-line drawing. one of the fundamental properties of a drawing is its readability, that is, the capability of conveying the information associated with the graph in a clear way. The benefits of graph drawing techniques for data mining purposes
include analysis through visual exploration, discovery of patterns and correlations, and abstraction and summarization. Social network analysts use graph-theoretic methodologies to
identify important actors, crucial links, subgroups, roles, and network characteristics
in such graphs. In this context, graph visualization is essential to make complicated
types of analysis and data handling, such as identification of recurrent patterns and
of relationships, more transparent, intuitive, and readily accessible
For example, the centrality role of an element in a collaboration network can be more easily detected by visual inspection of a radial drawing of the network itself that is, a drawing where vertices lie on concentric circles. Web Searching is a developing effective technologies for mining data from the World Wide Web is a very fertile research field. The effectiveness and the efficiency of most common search engines rely on the structural properties of the Web graph
(Web resources and their hyperlinks). Valuable insight into the link structure and
the page rank of the Web graph can be provided by automatic visualization systems. Also, a new
generation of Web search engines is being designed, where the data collected from the World Wide Web are clustered into meaningful semantic categories. The user interacts with the Web data by means of a graphical user interface that displays these categories organized as a tree structure or organized as a graph.

Three of the most popular graph drawing approaches, namely, the force-directed, the hierarchical, and the topology-shape-metrics approach:

Force-directed methods are quite popular for network analysis since they are easy to
implement and often produce aesthetically pleasing results. The basic idea is idea is to model a graph G as a mechanical system where the vertices of G are steel rings and the edges of G are springs that connect pairs of rings. To compute a drawing of G, the rings are placed in some initial layout and let go until the spring forces move the system to a local minimum of energy.

In the hierarchical approach hierarchies arise in a variety of applications, for example, Project Evaluation Review Technique (PERT) diagrams in project planning, class hierarchies in software engineering, and is-a relationships in knowledge representation systems. It is customary to represent hierarchical graphs so that all the edges flow in the same direction, for
example, from top to bottom or from left to right. Algorithms that compute drawings
of hierarchies generally follow a methodology first introduced by Sugiyama et al that accepts, as input, a directed graph G without any particular restriction (G can be planar or not, acyclic or cyclic) and that produces as output a layered drawing of G, that is, a representation where the vertices are on horizontal layers and the edges are monotone polygonal chains connecting their end vertices.

The topology-shape-metrics approach is one of the most effective techniques for
computing highly readable drawings of graphs. It represents graphs in the so-called orthogonal drawing standard, that is, each vertex of the graph is drawn as a point or a box and each edge is drawn as a polygonal chain of horizontal and vertical segments. Orthogonal drawings are widely used in many application fields, like very large scale integration (VLSI), software engineering, information systems, internet, andWeb computing. An example showing the topology-shape visualization approach.

Blogosphere research issues

This post summarizes the research issues section of Nitin Agrwal and Huan Kui: “Blogosphere: Research Issues, Tools, and Applications”, in the Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008.

1-Modeling the Blogosphere

This issue cares about the models that best describes the structure and properties of the blogosphere in order to gain a deeper insights into the relationships between bloggers, commenters, blog posts, comments, viewers/readers, and different blog sites in the blogosphere. Modeling the blogosphere is often associated with modeling the web. Researchers represent the web as a webgraph, where each webpage forms a node and hyperlinks between them as edges. This kind of representation results in a directed cyclic graph. Weights can be associated with these edges. Such a model that converts the web into a graphic model is extensively exploited. More about web graph can be found in [1], and I might talk about it in a later post. So, why don’t we model the blogosphere as a graph?
First, models developed for the web assumes a dense graph structure due to a large number of interconnecting hyperlinks within web pages. This assumption does not hold true in the blogosphere, since the hyperlink structure in the blogosphere is very sparse, as shown in. Second, the level of interaction in terms of comments and replies to a blog post makes the blogosphere different from the web. Third, the highly dynamic and \short-lived" nature of the blog posts could not be simulated by the web models. Web models do not consider this dynamicity in the web pages. They assume web pages accumulate links over time. However, in a blog network, where blog posts are the nodes, it is impractical to construct a static graph like the one for the web. These differences necessitate the need for a model more towards the characteristics of the blogosphere.
This has motivated researchers to come up with models specific to the blogosphere.
For example, Leskovec et al. [2] studied the temporal patterns of the blogosphere like how often people create blog posts, burstiness and popularity, how these blog posts are linked, and what is the link density. They reported that these phenomena follow power law distributions and they managed to create a blog network.
Kumar et al. [3] use the blogrolls given on a blog post to create a network of connected posts with the underlying assumption that blogrolls have links to related or similar blog posts.
A lot of research has been conducted that posits a known network structure of the blogosphere to model the problem domain. Such models are specific to problem domains.

2- Blog Clustering

A lot of research is going on to automatically cluster different blogs into meaningful groups such that readers can focus on interesting categories, rather than filtering out relevant blogs from the jungle. Often blog sites allow their users to provide tags to the blog posts. The human labeled tag information forms the so-called “folksonomy".
Brooks and Montanez [4] presented a study where the a study where the human labeled tags are good for classifying the blog posts into broad categories while they were less effective in indicating the particular content of a blog post. They used the tf-idf measure to pick the top three most famous words in every blog post and computed the pairwise similarity among all the blog posts and clustered them. They compared the results with the clustering obtained using the human labeled tags and reported significant improvement. It was also found that keywords – based clustering suffers from the high dimensionality and sparsity . Agarwal et al. [5] proposed WisClus that uses the collective wisdom of the bloggers to cluster the blogs. They have used the blog categories and construct the category relation graph to merge different categories and cluster the blogs that belong to these categories. Edges in the category relation graph represent the similarity between different categories which are the nodes in this graph. Their results showed that WisClus is better than keywords-based clustering.
I am going to talk about WisClus in details in a later post.

3- Blog Mining

Blogs are immensely valuable resources to track consumers' beliefs and opinions, initial reaction to a launch, understand consumer language, track trends and buzzwords,fine tune information needs. Blog conversations leave behind the trails of links, useful for understanding how information flows and how opinions are shaped and influenced. Tracking blogs also help in gaining deeper insights as bloggers share their views from various perspectives hence giving a 'context' to the information collected.
A prototype system called Pulse [19] uses a Naijve Bayes classifier trained on manually annotated sentences with positive/negative sentiments and iterates until all unlabeled data is adequately classified. Another system presented in [5] improves the blog retrieval by using opinionated words acquired from WordNet in the query proximity. More about WordNet and Naïve Bayes in later posts.

4-Community Discovery and Factorization

One method that researchers commonly use is content analysis and text analysis of the blog posts to identify communities in the blogosphere. An alternative approach in identifying communities in web using a hub and authority based approach, clustering all the expert communities together by identifying them as authorities. More about hub and authority (http://www.urlanalysis.info/hubs.asp). While Chin and Chignell [6] proposed a model for finding communities taking the blogging behavior of bloggers into account, they aligned behavioral approaches in studying community with the network and link analysis approaches. I am going to talk about this paper in details later.

5- Filtering Spam Blogs (a.k.a. splogs)

Besides degrading search quality results splogs also wastes the network resource. Some researchers consider spam blogs detection is a case of web spam. But there are some critical differences between web spam detection and splog detection. The content on blog sites is very dynamic as compared to that of web pages, so content based spam filters are ineffective. Moreover, spammers can copy the content from some regular blog posts to evade content based spam filters. Link based spam filters can easily be beaten by creating links pointing to the splogs.


Two more research issues were presented ; Influence in Blog and propagation and Trust and Reputation which are not currently essential to our work.

References


[1]Ravi Kumar, Parpnhakar Raghavan, Siridhar Rajagopalan, D.Sivakumar, Andrew Tompkins, Eli Upafal, “The web as a graph”, in the Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2000.



[2] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs. In SIAM International Conference on Data Mining, 2007.



[3] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the Bursty Evolution of Blogspace. In Proceedings of the 12th international conference on World Wide Web, pages 568{576, New York, NY, USA, 2003. ACM Press.



[4] Christopher H. Brooks and Nancy Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 625{632, New York, NY, USA, 2006. ACM Press.



[5] Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar Subramanya. Clustering blogs with collective wisdom. In Proceedings of the International Conference on Web Engineering, 2008.



[6] Alvin Chin and Mark Chignell. A social hypertextmodel for finding community in blogs. In HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia, pages 11{22, New York, NY,USA, 2006. ACM Press.