Workshop Program

Saturday, November 26th, 2011
15:00-15:15 Opening (Chair: Satoshi Oyama, Hokkaido University, Organization Co-Chair)
Welcome Address
Young-Kuk Kim, Chungnam National University, General Co-Chair
Kazutoshi Sumiya, University of Hyogo, General Co-Chair
15:15-15:45 Keynote Speech 1 (Chair: Yutaka Kidawara, NICT, Program Co-Chair)
DB-IR Integration and Its Application to a Massively-Parallel Search Engine
Kyu-Young Whang, KAIST, Honorary Co-Chair
15:45-16:45 Research Session 1 (Chair: Kyuseok Shim, Seoul National University)
 15:45-16:15 SVM Indexing and Processing for Data Retrieval
Hwanjo Yu, POSTECH
 16:15-16:45 GPU-Accelerated Complex Data Mining
Toshiyuki Amagasa, University of Tsukuba
 16:45-17:00 Break
17:00-18:30 Research Session 2 (Chair: Miyuki Nakano, The University of Tokyo)
 17:00-17:30 iGraph: A Framework for Comparisons of Disk‐based Graph Indexing Techniques
Wook-Shin Han, Kyungpook National University
 17:30-18:00 Fast and Effective Mining of Web-click Sequences
Yasushi Sakurai, NTT Communication Science Labs
 18:00-18:30 Managing Entities on the Web
Seung-won Hwang, POSTECH

Sunday, November 27th, 2011
9:00-9:30 Keynote Speech 2 (Chair: Yang-Sae Moon, Kangwon National University, Program Co-Chair)
Info-plosion Analytics Be New Enabler =Information Diffusion Studies on 3.11 diaster =
Masaru Kitsuregawa, The University of Tokyo, Honorary Co-Chair
9:30-10:30 Project and Future Vision Session 1 (Chair: Haruo Yokota, Tokyo Institute of Technology)
  9:30-10:00 Design and Implementation of WikiBOK: A BOK Formulation-aid System based on a Collective Intelligence Approach
Yoshifumi Masunaga, Aoyama Gakuin University
 10:00-10:30 Intelligent Computing for Smart Reality
Sang-goo Lee, Seoul National University
 10:30-10:45 Break
10:45-11:45 Project and Future Vision Session 2 (Chair: Jinho Kim, Kangwon National University)
 10:45-11:15 Search-Intent Types and Their Detection for Web Search
Katsumi Tanaka, Kyoto University, Honorary Co-Chair
 11:15-11:45 Flash SSD as Database Machine
Sang-Won Lee, Sungkyunkwan University
11:45-13:00 Lunch
13:00-15:00 Project and Future Vision Session 3 (Chair: Yoshiharu Ishikawa, Nagoya University)
 13:00-13:30 Towards Computational History through Large Scale Text Mining: Studies in Collective Memory
Adam Jatowt, Kyoto University
 13:30-14:00 SNQLSpatial+ (Sensor network query language)
Soon Joo Hyun, KAIST
 14:00-14:30 Information Services Platform: Shifting Cloud from Collective Intelligence to Connective Intelligence
Koji Zettsu, NICT
 14:30-15:00 Insight Search: a Challenge and Opportunity for Search in Web 2.0
Jaewoo Kang, Korea University
 15:00-15:10 Break
15:10-16:00 Discussion

Monday, November 28th, 2011
10:00-12:00 Steering Committee Meeting
13:00-16:00 Group Meeting
16:00-17:00 Closing Session

Saturday, November 26th, 2011
DB-IR Integration and Its Application to a Massively-Parallel Search Engine
Kyu-Young Whang, KAIST, Honorary Co-Chair
Nowadays, as there is an increasing need to integrate the DBMS (for structured data) with Information Retrieval (IR) features (for unstructured data), DB-IR integration is becoming one of major challenges in the database area. Extensible architectures provided by commercial object-relational DBMS(ORDBMS) vendors can be used for DB-IR integration. Here, extensions are implemented using a high-level (typically, SQL-level) interface. We call this architecture loose-coupling. The advantage of loose-coupling is ease of implementation. But, loose-coupling is not preferable for implementing new data types and operations in large databases when high performance is required. In this talk, we present a new DBMS architecture applicable to DB-IR integration, which we call tight-coupling. In tight-coupling, new data types and operations are integrated into the core of the DBMS engine in the extensible type layer. Thus, they are incorporated as the “first-class citizens” within the DBMS architecture and are supported in a consistent manner with high performance. This tight-coupling architecture is being used to incorporate IR features and spatial database features into the Odysseus ORDBMS that has been under development at KAIST/AITrc for over 19 years. In this talk, we introduce Odysseus and explain its tightly-coupled IR features (U.S. patented in 2002). Then, we demonstrate excellence in performance of tight-coupling by showing benchmark results. We have built a web search engine that is capable of managing 100 million web pages per node in a non-parallel configuration using Odysseus. This engine has been successfully tested in many commercial environments. This work won the Best Demonstration Award from the IEEE ICDE conference held in Tokyo, Japan, in April 2005. Last, we present a design of a massively-parallel search engine using Odysseus. Recently, parallel search engines have been implemented based on scalable distributed file systems (e.g., GFS). Nevertheless, building a massively-parallel search engine using a DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system while providing scalability. The parallel search engine designed is capable of indexing 30 billion web pages with a performance comparable to or better than those of state-of-the-art search engines.
SVM Indexing and Processing for Data Retrieval
Hwanjo Yu, POSTECH
"SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for classification, regression, and ranking. Recently SVM has been actively researched for rank learning and applied to various applications including search engines or relevance feedback systems. A query in such systems is the ranking function F learned by SVM. Once learning a function F or formulating the query, processing the query to find top-k results requires evaluating the entire database by F. So far, there exists no \emph{exact} indexing solution for SVM functions. Existing top-k query processing algorithms are not applicable to the machine-learned ranking functions, as they often make restrictive assumptions on the query, such as linearity or monotonicity of functions. Existing metric-based or reference-based indexing methods are also not applicable, because data points are invisible in the kernel space (SVM feature space) on which the index must be built. Existing kernel indexing methods return approximate results or fix kernel parameters. This paper proposes an exact indexing solution for SVM functions with varying kernel parameters. We first propose key geometric properties of the kernel space -- ranking instability and ordering stability -- which is crucial for building indices in the kernel space. Based on them, we develop an index structure kernel and processing algorithms. We then present clustering techniques in the kernel space to enhance the pruning effectiveness of the index. According to our experiments, kernel is highly effective overall producing 1~5% of evaluation ratio on large data sets. According to our best knowledge, kernel is the first indexing solution that finds exact top-k results of SVM functions without a full scan of data set."
GPU-Accelerated Complex Data Mining
Toshiyuki Amagasa, University of Tsukuba
"GPGPU (General Purpose computing on GPU) has recently been aninteresting research subject in the field of high performancecomputing and many other fields. GPGPU implies using GPU (GraphicsProcessing Unit), which is originally designed for processinggraphics, for general purpose computation. In this talk, we presentour recent works on GPU-accelerated data mining over complex data. Atfirst, we present a scheme to accelerate the Probabilistic LatentSemantic Indexing (PLSI), which is an automated document indexingmethod based on a statistical latent semantic model, exploiting thehigh parallelism of GPU. In the next, we introduce a method for fastfrequent itemset mining from uncertain databases using GPU. The ideais to accelerate probability computations by making the best use of GPU."
iGraph: A Framework for Comparisons of Disk‐based Graph Indexing Techniques
Wook-Shin Han, Kyungpook National University
Graphs are of growing importance in modeling complex structures such as chemical compounds, proteins, images, and program dependence. Given a query graph Q, the subgraph isomorphism problem is to find a set of graphs containing Q from a graph database, which is NP-complete. Recently, there have been a lot of research efforts to solve the subgraph isomorphism problem for a large graph database by utilizing graph indexes. By using a graph index as a filter, we prune graphs that are not real answers at an inexpensive cost. Then, we need to use expensive subgraph isomorphism tests to verify filtered candidates only. This way, the number of disk I/Os and subgraph isomorphism tests can be significantly minimized. The current practice for experiments in graph indexing techniques is that the author of a newly proposed technique does not implement existing indexes on his own code base, but instead uses the original authors' binary executables and reports only the wall clock time. However, we observe this practice may result in several problems. In order to address these problems, we have made significant efforts in implementing all representative indexing methods on a common framework called iGraph. Unlike existing implementations which either use (full or partial) in-memory representations or rely on OS file system cache without guaranteeing real disk I/Os, we have implemented these indexes on top of a storage engine that guarantees real disk I/Os. Through extensive experiments using many synthetic and real datasets, we also provide new empirical findings in the performance of the full disk-based implementations of these methods.
Fast and Effective Mining of Web-click Sequences
Yasushi Sakurai, NTT Communication Science Labs
"Given a large stream of users clicking on web sites, how can we find trends, patterns and anomalies? We have developed a novel method, WindMine, and its fine-tuning sibling, WindMine-part, to find patterns and anomalies in such datasets. Our approach has the following advantages: (a) it is effective in discovering meaningful ``building blocks'' and patterns such as the lunch-break trend and anomalies, (b) it automatically determines suitable window sizes, and (c) it is fast, with its wall clock time linear on the duration of sequences. Moreover, it can be made sub-quadratic on the number of sequences, with little loss of accuracy. We examine the effectiveness and scalability by performing experiments on real data (one billion clicks for 30 days). Our proposed WindMine does produce concise, informative and interesting patterns. We also show that WindMine-part can be easily implemented in a parallel or distributed setting, and that, even in a single-machine setting, it can be an order of magnitude faster (up to 70 times) than the plain version."
Managing Entities on the Web
Seung-won Hwang, POSTECH
This talk introduces the problem of managing a large scale entities, extracted in forms of graphs or tables. While existing approaches focus on using surface names, our proposed approaches will discuss how relationships can be leveraged and how scalability can be achieved. We present evaluation results validating these approaches.

TOP

Sunday, November 27th, 2011
Info-plosion Analytics Be New Enabler =Information Diffusion Studies on 3.11 diaster =
Masaru Kitsuregawa, The University of Tokyo, Honorary Co-Chair
Recently new buzzword, Big Data, is becoming popular. I suppose the emotion behind is the same as Info-plosion. Several analytics examples will be shown including information diffusion studies on 3.11.
Design and Implementation of WikiBOK: A BOK Formulation-aid System based on a Collective Intelligence Approach
Yoshifumi Masuganaga, Aoyama Gakuin University
The design and implementation of WikiBOK, a Wiki-based body of knowledge (BOK) formulation-aid system, is investigated. In contrast to formulating a BOK for a matured discipline such as computer science, BOK formulation for a new discipline such as social informatics needs a “bottom-up” approach because academics in a new discipline cannot draw its entire figure par avance. Therefore, an open collaboration approach based on the collective intelligence concept seems promising. WikiBOK is under development as part of our project based on BOK+, which is a novel BOK formulation principle for new disciplines. It uses Semantic MediaWiki (SMW) to facilitate its fundamental functions. To support a rich graphical user interface for WikiBOKers, a graph visualization software, Graphviz, is adopted. SMW is enhanced to work in conjunction with Graphviz. Because edit conflicts occur when WikiBOKers collaborate, a resolution principle is investigated to resolve BOK tree edit conflicts.
Intelligent Computing for Smart Reality
Sang-goo Lee, Seoul National University
"Smart reality is a world where reality and cyberspace are tightly integrated by billions of sensing and control devices that are capable of autonomous computation and communication; thereby, empowering the reality with computational intelligence. We are at the transition into such a world. In a smart reality, real world entities, such as people, devices, buildings and environment, will consistently interact and cooperate with each other in dynamic ways. This calls for an effective way of collaboration between devices that is adaptive and simple. The ability to select the right functions/services as well as collaboration partners depending on the current context will be important. We present some of our research proposals for achieving this computing platform with a data-oriented approach."
Search-Intent Types and Their Detection for Web Search
Katsumi Tanaka, Kyoto University, Honorary Co-Chair
Search intent and their detection is a hot research area in Web search, such as (1) search query classification (informational, navigational and transactional queries) , (2) search result diversification, and (3) query recommendation. In this talk, after a brief survey on the research of search intent detection, we will introduce a new framework on search intents: "search-intent types" and its detection. Our proposing search-intent types are, for example, "typicality", "diversity", "comprehensibility", "concreteness" etc. as well as the well known "query-document similarity (relevance)". We show query processing methods for these new types of search-intents. Next, we suggest methods to detect those "search-intent types" for user search queries. The proposing methods is based on (1) the usage of knowledge extracted from community QA corpus and (2) relevance feedback based on search-intent types.
Flash SSD as Database Machine
Sang-Won Lee, Sungkyunkwan University
Flash memory-based SSD is becoming popular due to its outstanding performance compared to conventional magnetic disk drives. Today, SSDs are essentially a block device attached to a legacy host interface. As a result, the system I/O bus remains a bottleneck, and the abundant flash memory bandwidth and the computing capabilities inside SSD are largely untapped. In this talk, we propose to accelerate key database operations, scan and join, for large-scale data analysis by moving data-intensive processing from the host CPU to inside flash SSDs (“in-storage processing”), close to the data source itself. To realize the idea in a cost-effective manner, we deploy special-purpose compute modules using the System-on-Chip technology. While data from flash memory are transferred, a target database operation is applied to the data stream on the fly without any delay. This reduces the amount of data to transfer to the host drastically, and in turn, ensures all components along the data path in an SSD are utilized in a balanced way. Our experimental results show that in-storage processing outperforms conventional processing with a host CPU by over 5 and 47 times for join and combined join and scan operations, respectively. It also turns out that in-storage processing can be realized at only 1% of the total SSD cost, while offering sizable energy savings of 45× compared to host processing.
Towards Computational History through Large Scale Text Mining: Studies in Collective Memory
Adam Jatowt, Kyoto University
"History helps us understand the past, present and even to predict the future to certain extent. Given the huge amount of data about the past, we believe computer science will play an increasingly important role in historical studies, with computational history becoming an emerging interdisciplinary field of research. In this presentation we show the results of the study about how the past is remembered through large scale text mining. We have collected a large dataset of news articles about different countries and analyzed the data using computational and statistical tools. We show that analysis of references to the past in news articles allows us to gain a lot of insight into the collective memories and societal views of different countries. Our work demonstrates how various computational tools can assist us in studying history by revealing interesting topics and hidden correlations. Our ultimate objective is to enhance history writing and evaluation with the help of algorithmic support. In the talk we outline our plans for continuing the research in computational history."
SNQLSpatial+ (Sensor network query language)
Soon Joo Hyun, KAIST
Database management in the wireless sensor network requires features that are different from the traditional database management. Data are collected from a massively large number of battery-operated sensor nodes in a wireless network. A large amount of dynamically changing data should be temporally queried and collected from the sensor nodes of multiple layers. In this talk, we introduce a new query language, SNQL for the present and future ubiquitous sensor networks. To minimize the energy consumption SNQL reduces unnecessary query executions by adaptive query operations to dynamic environments. It also introduces a querying mechanism of controlling the quality of collected data in association with node selection strategy. We show how the language construct in SNQL achieves the intended querying efficiency.
Information Services Platform: Shifting Cloud from Collective Intelligence to Connective Intelligence
Koji Zettsu, NICT
The clouds have grown as playgrounds for collective intelligence uncovered by collecting and mining huge amount of data from individuals and societies based on scale-out computing. The Information Services Platform project at NICT aims to open a new door to "connective" intelligence by building an ICT platform for aggregating, connecting, and extending the cloud data and services. The target applications include trans-disciplinary data analysis for science clouds, or glocalization in participatory sensing and context aggregation for cyber-physical clouds. The R&D challenges lie in fusion of network, database and service computing technologies such as service-controlled networking, cross-domain data warehousing, and data-intensive service computing.
Insight Search: a Challenge and Opportunity for Search in Web 2.0
Jaewoo Kang, Korea University
"In this talk, I will introduce two search-related on-going projects in Data Mining Lab at Korea University. The first project is about consensus search engine, named CONSENTO. CONSENTO is designed to answer particularly challenging queries such as “best action movies in 2010“ or “thrillers with plot twists.” For the first query, conventional search engines would simply return documents that contain the query words such as “action movies” and “2010”. However, what the users might have intended for was a ranked list of popular action movies. Similarly for the second query, the conventional search engines would also fail to return correct answers as it will simply return movies whose titles or descriptions match to the keywords in the query. In order to address this problem, CONSENTO takes drastic departure from conventional search engine techniques. First, in order to capture precise semantics from users’ opinions, CONSENTO employs subdocument-level indexing, as compared to conventional document-level indexing. CONSENTO segments a document into multiple Maximal Coherent Semantic Units (MCSUs) and index them instead of the document. Second, in order to produce ranked results more close to the users’ intent, we introduce ConsensusRank, a new ranking method based on public sentiment. I will detail the two main techniques in this talk. The second project is about BOSS, a Biomedical Object Search System. BOSS answers queries such as “cancer suppressor genes” or “proteins associated with breast cancer.” Unlike conventional search engines, BOSS returns entities that match to the query context, not documents contain query terms. BOSS indexes more than 20 million abstracts of academic articles in 28 different research categories. BOSS shares the same search engine architecture with CONSENTO. In this talk, I will explain how BOSS was implemented on top of CONSENTO."
TOP