Abstract
At an unprecedented rate, short-text messages such as tweets are being created and shared. While being informative, can also be overwhelming, tweets, in their raw form. It is a nightmare to plow through millions of tweets which contain enormous amount of noise and redundancy, for both end-users and data analysts. We propose a novel continuous summarization framework called Sumblr to alleviate the problem, in this paper. Sumblr is designed to deal with dynamic, fast arriving, and large-scale tweet streams, in contrast to the traditional document summarization methods which focus on static and small-scale data set. Our proposed framework consists of three major components. To cluster tweets and maintain distilled enumeration in a data structure called tweet cluster vector (TCV), we propose an online tweet stream clustering algorithm. We proposed a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Which monitors summary-based/volume-based variations to produce timelines automatically from tweet streams, we design an effective topic evolution detection method. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our framework.