Research Article
Dynamic Web log session identification with statistical language models
Article first published online: 13 AUG 2004
DOI: 10.1002/asi.20084
Copyright © 2004 Wiley Periodicals, Inc.
Issue

Journal of the American Society for Information Science and Technology
Volume 55, Issue 14, pages 1290–1303, December 2004
Additional Information
How to Cite
Huang, X., Peng, F., An, A. and Schuurmans, D. (2004), Dynamic Web log session identification with statistical language models. Journal of the American Society for Information Science and Technology, 55: 1290–1303. doi: 10.1002/asi.20084
Publication History
- Issue published online: 10 NOV 2004
- Article first published online: 13 AUG 2004
- Manuscript Accepted: 23 JAN 2004
- Abstract
- Article
- References
- Cited By
Abstract
We present a novel session identification method based on statistical language modeling. Unlike standard timeout methods, which use fixed time thresholds for session identification, we use an information theoretic approach that yields more robust results for identifying session boundaries. We evaluate our new approach by learning interesting association rules from the segmented session files. We then compare the performance of our approach to three standard session identification methods—the standard timeout method, the reference length method, and the maximal forward reference method—and find that our statistical language modeling approach generally yields superior results. However, as with every method, the performance of our technique varies with changing parameter settings. Therefore, we also analyze the influence of the two key factors in our language-modeling–based approach: the choice of smoothing technique and the language model order. We find that all standard smoothing techniques, save one, perform well, and that performance is robust to language model order.

1532-2890/asset/olbannerleft.gif?v=1&s=d833098325c9f1060bcbee51adf276c155608167)
1532-2890/asset/olbannercenter.gif?v=1&s=661179918edb4fa732edfd3408eb050a6ce87809)
1532-2890/asset/olbannerright.gif?v=1&s=1ef8a363944134c502cbffa1937878a71b4cc635)