Microsoft word - 281-83-87
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
EVISTA – Interactive Visual Clustering System
K. Thangavel1, P. Alagambigai2
1 Department of Computer Science, Periyar University, Salem, Tamilnadu, India
Email:
[email protected]
2 Department of Computer Applications, Easwari Engineering College, Chennai, Tamilnadu, India
Email:
[email protected]
Abstract—Due to the enormous increase in the data, exploring
Visualization techniques could enhance the current
and analyzing them is increasingly important but difficult to
knowledge and data discovery methods by increasing the user
achieve. Information visualization and visual data mining can
involvement in the interactive process. More recently there
help to deal with this. Visual data exploration has a high
are a lot of discussions on visualization for data mining.
potential and many applications such as fraud detection and
Visual data mining can be viewed as an integration of data
data mining will use information visualization technology for an
improved data analysis. The advantage of visual data visualization and data mining [5, 15]. Considering
exploration is that the user is directly involved in the data
visualization as a supporting technology in data mining, four
mining process. There are a large number of information
possible approaches are stated in [1]. The first approach is the
visualization techniques which have been developed over the last
usage of visualization technique to present the results that are
decade to support the exploration of large data sets. VISTA is an
obtained from mining the data in the database. Second
interactive visual cluster rendering system which invites human
approach is applying the data mining technique to
into the clustering process, but there are some limitations in
visualization by capturing essential semantics visually. The
identifying the cluster distribution and human-computer third approach is to use visualization techniques to
interaction. In this paper, we propose an Enhanced VISTA
complement the data mining techniques. The fourth approach
(EVISTA) which addresses these drawbacks. EVISTA improves
the visualization in two ways: first it uses the weighted vector
uses visualization technique to steer mining process.
normalization instead of max-min normalization, which
In general, visualization can be used to explore data to
improves the data visualization such that the user can confirm a hypothesis or to manipulate a view. Exploratory
understand the underlying pattern without human intervention.
visualization creates a dynamic scenario in which interaction
Secondly it completely eliminates the use of α tuning, which
is critical. The user not necessarily know that what he/she is
reduces the complexity in visual distance computation and eases
looking for, can search for structures or trends and is
the human computer interaction in a better way. The attempting to arrive at some hypothesis. The confirmatory
experiment results show that EVISTA explore the underlying
visualization, in which the system parameters are often
pattern of the dataset effectively and reduces the user operation
predetermined and the visualization tools are used to confirm
burden greatly.
or refute the hypothesis. The manipulative visualization
Index Terms— Clustering, EVISTA, Human-computer focuses on refining the visualization to optimize the
interaction, Information visualization, Visual data mining.
presentation. Visualization has been categorized in to two major areas: i) scientific visualization –which focuses
primarily on physical data such as human body, etc. ii) Information visualization – which focuses on abstract
Data visualization is essential for understanding the nonphysical data such as text, hierarchies and statistical data.
concept of multidimensional spaces [5]. It allows the user to
Data mining techniques primarily oriented on information
explore the data in different ways at different levels of visualization [4]. Both scientific visualization and
abstraction to find the right levels of details. Therefore information visualization create graphical models and visual
techniques are most useful if they are highly interactive,
representations from data that support direct user interaction
permit direct manipulation and include a rapid response time.
for interaction for exploring and acquiring insight in to useful
Visualization is defined by ware as "a graphical information embedded in the underlying data [10, 15]. Even
representation of data or concepts" which is either an though visualization techniques have advantages over
"internal construct of the mind" or an "external artifact automatic methods, it brings up some specific problems such
supporting decision making". Visualization provides valuable
as limitation in visibility, visual bias due to mapping of
assistance to the human by representing information visually.
dataset to 2D/ 3D representation, easy-to-use visual interface
This assistance may be called cognitive support. Visualization
operations and reliable human-computer interaction. In most
can provide cognitive support through a number of of the visualization methods the human-computer interaction
mechanisms such as grouping related information for easy
costs than automated [9]. In general, the visual data mining is
search and access, representing large volumes of data in a
different from scientific visualization and it has the following
small space and imposing structure on data and tasks can
characteristics:
reduce time complexity, allowing interactive exploration
Wide range of users
through manipulation of parameter values [11].
Wide choice of visualization techniques and
2009 ACEEE DOI: 01.IJRTET.02.01.281
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
Important dialog function.
Star coordinate system is a traditional multivariate data
The users of scientific visualization are scientists and visualization technique in which the
k-axis is defined by an
engineers who can endure the difficulty in using the system
O = (
x,
y)
k coordinate
for little at most, whereas a visual data mining must have the
S ,1
S 2,
S ,., represents the
possibility that the general persons uses widely and so on
k dimensions in 2D spaces.
easily [16]. By considering this issue, this paper proposes a
The
k coordinates are equidistantly distributed on the
novel information visualization technique called enhanced
circumference of the circle C, where the unit vectors are
visual clustering system (EVISTA), an extension version of
VISTA [8]. VISTA, a dynamic data visualization model
which invite human into the clustering process. Even though
Si = (cos(
1 2,.,
k
VISTA proved to be an efficient interactive visual cluster
rendering system, it requires a complete user interaction And the 2D point
Q(
x,
y) is obtained by, throughout the clustering process. When the number of
dimension increases, the human computer interaction {
becomes tedious. EVISTA designed in such a way to provide
Qy = ⎨( )∑
xi'cos
an efficient data visualization such that the user can able to understand the underlying pattern of the given data set
without human intervention.
wt −
xi
The rest of the paper is organized as follows: Section 2
discusses reviews of the related works in the domain of
where
xi represents the given data object,
i
x ' represents the
information visualization. Section 3 deals with the EVISTA. Section 4 discusses the experimental analysis. Section 5
normalized data value based on weighted vector
concludes the paper.
II. RELATED WORKS
Various efforts are made to visualize multidimensional
EVISTA employs the design of VISTA visual cluster
datasets [2, 10, 11, 13]. The early research on general plot
rendering proposed by KeKe Chen and L. Liu [8] provides an
based data visualization is Grand Tour and Projection Pursuit
intuitive way to visualize clusters with interactive feedbacks
[2]. The purpose of the Grand Tour and Projection Pursuit is
to encourage domain experts to participate in the clustering
to guide user to find the interesting projections.
revision and cluster validation process. It allows the user to
L.Yang [2] utilizes the Grand Tour technique to show
interactively observe potential clusters in a series of
projections of datasets in an animation. They project the
continuously changing visualizations through
α. More
dimensions to co-ordinate in a 3D space. However, when the
importantly, it can include algorithmic clustering results and
3D space is shown on a 2D screen, some axes may be
serve as an effective validation and refinement tool for
overlapped by other axes, which make it hard to perform
irregularly shaped clusters [9]. The VISTA system has two
direct interactions on dimensions.
unique features. First, it implements a linear and reliable
Star coordinate [7] is an interactive visualization model
visualization model to interactively visualize the multi-
which treats dimensions uniformly, in which data are dimensional datasets in a 2D star-coordinate space. Second, it represented coarsely and by simple and more space efficient
provides a richest set of user-friendly interactive rendering
points, which result in less cluttered visualization for large
operations, allowing users to validate and refine the cluster
structure based on their visual experience as well as their
Interactive visual clustering (IVC) [10] combines spring-
domain knowledge.
embedded graph layout techniques with user interaction and
The VISTA visualization model consists of two linear
constrained clustering.
mappings: Max-min normalization followed by α-mapping.
VISTA [8, 9] is a recent visualization models utilizes star
Equation (5) represents the Max-Min normalization: is used
coordinate system provide similar mapping function like star
to normalize the columns in the datasets so as to eliminate the
co-ordinate systems. There are two types of cluster rendering
dominating effect of large-valued columns.
in VISTA model. The former one is unguided rendering and
⎡ 2 (
v − min)
the latter is guided rendering.
where
v is the original and
i
v is the normalized value. The
III. ENHANCED VISUAL CLUSTERING SYSTEM
α - mapping maps
k dimensional points on to two
Enhanced VISTA (EVISTA) is an information dimensional visual spaces with the convenience of visual
visualization frameworks employs improved data parameter tuning. visualization and reveal the hidden patterns in complex high
The proposed visualization model EVISTA utilizes the
dimensional data sets, without human intervention. The weighted vector normalization which is performed on rows EVISTA model is designed based on the star coordinates.
instead of columns, such that the visualization model defines
2009 ACEEE DOI: 01.IJRTET.02.01.281
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
the reliable position of
Q (
x,
y ) . EVISTA completely boundaries between the clusters become clearer. Figure. 2 eliminates the usage of α- tuning, since
α- mapping is tedious
show the visualization of iris dataset after
α tuning. As the
when the number of dimensions is high. And each change in
literature of iris dataset specified, the two clusters are not
α- values requires a fresh visual distance computation. As the
linearly separable. In VISTA it could be observed after the
number of dimensions increases, visual distance computation
fine tuning of
α. And the small region which consisting of the
process may create time complexity. Similar effects may
overlapping data points are also observed. And more
occur when the number of data objects increases. This makes
importantly the separation of two clusters found to be
the human computer interaction ineffective and affects the
difficult for the users.
applicability of VISTA.
B. Results and Discussion
EXPERIMENTAL ANLYSIS
To illustrate the efficiency of our proposed visualization,
empirical analyses are conducted on number of bench mark
data sets available in the UCI machine learning data Figure 1. Visualization of Iris Dataset using VISTA system
repository. The performance of EVISTA is compared against
VISTA system and the automatic clustering algorithm K-
Means. The experiments in VISTA are conducted by setting
α
value as 1.The detailed information of the data sets is shown
Figure 2. Visualization of Iris Dataset after α- tuning using VISTA system
ETAILS OF DATASETS
Attributes Classes
Figure 3. Visualization of Iris Dataset using EVISTA system
10 2 699
r
In VISTA, the domain knowledge plays a vital role in
finding the optimum number of clusters. In general, the
domain knowledge in the form of labeled items obtained by
traditional automatic clustering algorithms such as K-Means
of clusters is very important in cluster analysis, because
can be incorporated in to the visual clustering process. And a
clustering methods tend to generate clustering even for fairly
user without domain knowledge may fail in finding the
homogeneous datasets. The quality of clusters obtained optimum clusters, since
α tuning change the data point through visual clustering is measured in terms of three distribution. Most of the automated clustering algorithms classical methods proposed in [3];
require the number of clusters to be specified prior, that may
not coincide with real cluster distribution of the dataset. This
The Rand index and Jaccard coefficient validations are based on the agreement between clustering increases the complexity of clustering process. EVISTA results and the "ground truth".
reduces the complexity of clustering by eliminating the usage
The classical validity measures are heavily related to the
of
α. Figure. 3 show the iris dataset visualization based on
geometry or density nature of clusters and they do not work
well for arbitrary shaped clusters [8]. In such cases, visual
From the results, it is observed that one cluster is
perception plays an important in deciding right clusters.
completely separated from the others and the visual
boundaries between the other two clusters are clearly
Iris Data: Iris dataset is a benchmark dataset widely used
identified. It is also noticed that there are only two data points
in pattern recognition and clustering. It is formed by 150 four
are overlapped. Since EVISTA doesn't possess
α tuning the
dimensional instances of the three classes of plants classified
process of visual distance computation process is completely
according to the sepal length and width and the petal length
eliminated, which reduces the time complexity. EVISTA
and width. The iris dataset consists of three clusters with
doesn't require the domain knowledge in any form, which
equal distribution. One cluster is linearly separable from the
eases the human computer interaction and it visualizes the
other two; the latter two are not exactly linearly separable
exact pattern of the given dataset without human intervention.
from each other. Figure.1 shows the initial visualization of
iris dataset in VISTA model, where we observe the possibility
Australian Data:
of three clusters. And it is observed from the figure that, one
Australian Dataset concerns with credit card applications.
cluster is completely separated from the other two, where the
This dataset is interesting because there is a good mix of
remaining two are found to be overlapped. After performing
attributes continuous, nominal with small numbers of
interactive visual clustering with suitable
α tuning the visual
values, and nominal with larger numbers of values. This data
2009 ACEEE DOI: 01.IJRTET.02.01.281
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
set also has missing values. Suitable statistical based
With the development of data collection technology,
computation is applied for finding the missing values. It has
effective data visualization models are required to understand
two classes. The class distribution is 44.5% for class A and
the pattern of multidimensional and multivariate data. In this
55.5% for class B.
paper Enhanced VISTA is proposed to gain improvement in
Figure.4 show the visualization of Australian data set in
data visualization. EVISTA is designed with weight vector
VISTA, where possibly one single cluster is observed. During
normalization, which improves the data exploration. And the
α tuning, the user can able to identify the two clusters. If the
α
elimination of
α tuning in the visualization process reduces
tuning is not performed carefully, the user may get different
the complexity of human – computer interaction. More
pattern which may leads confusion. Figure. 5 show the importantly EVISTA doesn't require the domain knowledge process of
α tuning, where it is observed four cluster in any form, which improves the applicability of EVISTA. distribution. This leads a poor cluster quality. In such case,
The experiment results show that the EVISTA efficiently
domain knowledge is the only aid to identify the optimum
identifies the cluster distribution and reduces the complexity
number of clusters. Figure. 5 show the cluster distribution
in the visual distance computation. Specifically it eases the
using EVISTA; where two potential clusters are observed.
human-computer interaction.
Since
α tuning is not included in the EVISTA model, the
cluster distribution can be clearly visualized. Even though the
user doesn't have enough domain knowledge in any of the
form such as: number of clusters, cluster distribution,
visualization model EVISTA suitably identifies the optimum
number of clusters.
Pima Data
Figure 4. Visualization of Australian Dataset using VISTA system
Pima Dataset is an Indian Diabetes Database with 768 data objects. It has two classes with class distribution as 500 and
268. It consists of attributes such as number of times
pregnant, Plasma glucose concentration, Diastolic blood
pressure (mm Hg), Triceps skin fold thickness (mm),
Diabetes pedigree function, etc. Figure. 7 show the VISTA
visualization of pima Indian dataset. When the pima dataset is
visualized using VISTA, one possible cluster is observed.
Even the suitable
α tuning doesn't distinguish the clusters.
Visualization of Australian Dataset using VISTA system with α-
The boundary regions of the two clusters are possibly not identified.
Whereas EVISTA visualization of pima dataset clearly
shows two potential clusters. From Fig. 8 it is observed that
pima dataset contains two potential clusters, and few data
objects are scattered around the potential area. Since EVISTA
doesn't require
α tuning the user may find it very flexible in
finding the underlying pattern of the dataset without human
intervention. And with suitable geometric transformation
such as scaling and rotation the user may able to observe the
Figure 6. Visualization of Australian Dataset using EVISTA
cluster distribution according to their visual perception.
C. Comparative Analysis
This part of the section compares the results of EVISTA
with VISTA and the centroid based automatic clustering
algorithm K-Means. In EVISTA the cluster labeling is
performed using free hand drawing. The area with potential
data points are covered by convex hull and the data points in
Figure 7. Visualization of Pima Dataset using VISTA system
the convex hull are labeled as one single cluster. The cluster
results are evaluated based on Rand Index and Jaccard
coefficients are shown in Table II and Table III. The results
of VISTA are obtained by conducting the experiments on
several runs and the average of them is taken for experimental
Figure 8. Visualization of Pima Dataset using EVISTA system
2009 ACEEE DOI: 01.IJRTET.02.01.281
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
[1] Bhavani Thuraisingham, "DataMining: Technologies,
Techniques, Tools and Trends", CRC press, London,Newyork, Washington,1999.
[2] Cook, D.R., Buja, A., Cabrea, J., and Harley, H.: Grand Tour
and Projection pursuit.
J.Computational and Graphical
Visual Clustering
Statistics, v23, (1995).
[3] Daxin Jiang, Chun Tang, Aidong Zhang, "Cluster analysis for
gene expression data: a survey"
, IEEE Transactions on
Without α
With α
Knowledge and Data Engineering, Vol. 16, No.11, 2004.
[4] Daniel, Keim, A., and Hans-Peter (1996), ‘Visualization
Techniques for Mining Large Databases:A Comparison',
IEEE
Transactions on Knowledge and Data Engineering, Vol. 8, No.
50.58 [5] J. Han and M. Kamber," Data Mining: Concepts and
Techniques,"
Morgan Kaufmann Publishers, August 2000,
Australian 63.46 68.00
ISBN 1-55860-489-8.
COMPARISON OF EVISTA WITH VISTA AND K-MEANS BASED ON RAND
[6] A., K ,Jain,, M. N., Murty and Flynn P.J
," Data clustering : A
Review
", ACM computing surveys, 1999.
[7] E. Kandogan," Visualizing Multi-dimensional Clusters
,"
COMPARISON OF EVISTA WITH VISTA AND K-MEANS
Trends and outliers using star co-ordinates,
Proc of ACM
BASED ON JACCARD COEFFICIENT
KDD, 2001.
[8] Keke Chen and Liu. L, "VISTA: "Validating and Refining
Visual Clustering
clusters via Visualization",
Information Visualization, Vol. 3, 4,
K-Means [9] Keke Chen and Liu.L, "iVIBRATE:" Interactive Visualization-
With α
Based Framework for Clustering Large Datasets",
ACM
α tuning
Transactions on Information Systems, Vol. 24, April 2006,
[10] Marie desJardins, James MacGlashan, Julia Ferraioli,"
Interactive visual clustering
," Intelligent User Interfaces 2007,
45.84 [11] Melanie Tory and Torsten Moller, "Human Factors in
Australian 48.82
Visualization Research,"
IEEE Transactions on Visualization and Computer Graphics, 10(1), 2004.
[12] Pang-ning Tan, Michael Steinbach and Vipin Kumar,
"Introduction to Data Mining",
Pearson Addison Wesley, Boston, 2006.
[13] O.,Sourina., D., Liu.,"Visual interactive 3-dimensional
clustering with implicit functions",
Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, Volume:
1, 1-3 Dec 2004, pp. 382-386.
[14] Thangavel. K and Ashok Kumar. D, ‘Optimization of code
Figure 9. Comparison based on Rand Index
book in Vector Quantization",
International Journal Annals of Operations Research, Vol.143, No.1, 317-325, 2006.
[15] Ye N., "The Hand Book of Data Mining",
Lawrence Erlabum
Associates, Publishers, Mahwah, Newjersey, 2003.
[16] Zhen Liu, Shinichi Kamohara., Minyi Guo,"A Scheme of
interactive Data Mining Support System in Parallel and Distributed Environment,"
ISPA 2003, LCNS 2745, Springer-
verlag, pp. 263-272, 2003.
Figure 10. Comparison based on Jaccard coefficients
First author expresses his thanks to University Grants
Commission for financial support (F-No. 34-105/2008, SR).
2009 ACEEE DOI: 01.IJRTET.02.01.281
Source: http://searchdl.org/public/journals/2009/IJRTET/2/1/281.pdf
How Breakfast Happens in the Café Eric Laurier ABSTRACT. In this article I present an ethnographic study of ‘breakfast in the café', to begin to document the orderly properties of an emergent timespace. In so doing, the aim is to provide a descrip- tion of the local production of timespace and a consideration of a change to the daily rhythm of city life. Harold Garfinkel and David
LEARNING FROM PRACTICE Dapagliflozin: Clinical practice comparedwith pre-registration trial data ANDREW P MCGOVERN1-3, NINA DUTTA1, NEIL MUNRO1-4, KENNETH WATTERS1,2,4, MICHAEL FEHER1,2,4 Abbreviations and acronyms Background: Dapagliflozin is the first sodium-glucose co-transporter 2 (SGLT2) inhibitor to be approved in Europe