Discovering Local Patterns for Concept Description
Martin Atzmueller, Florian Lemmerich, Beate Krause, and Andreas Hotho
{atzmueller, lemmerich, krause, hotho}@informatik.uni-wuerzburg.de
Abstract. Concept description is an important task of descriptive data mining:Basically, its aim is to identify and to summarize properties of a selected targetpopulation in the form of a set of patterns – in a concise and comprehensibleway. In this paper we present an approach for concept description in the socialbookmarking domain: We show how subgroup discovery can be utilized for iden-tifying discriminative and characteristic local patterns in order to understand thebehavior of (non-)spammers. A case study applying data from a real-world sys-tem for social bookmarking provides exemplary results and demonstrates the ap-plicability and effectiveness of the presented approach.
Concept description or class description, e.g., [1], is an important method used in de-scriptive data mining that comprises two subtasks: Concept characterization aims tosummarize a given target population in terms of typical or characteristic features. Incontrast, concept/class discrimination generates descriptions comparing the target pop-ulation to one or more contrasting populations. In this way, both techniques aim todescribe the target population in complementing ways: Concept discrimination focuseson the differences between classes, by contrasting their discriminating features. On theother hand, concept characterization focuses on the common or typical features of acertain class. As we will see later, there is a trade-off between the characterization anddiscrimination goals, when considering the respective patterns.
Subgroup discovery [2, 3], is a broadly applicable technique for identifying prop-
erties of a selected target population: It is usually applied for data exploration and de-scriptive induction, in order to identify relations between a dependent (target) variableand usually many independent variables, e.g., consider the subgroups "users which takea longer time till their first post and have no middle name indicate spammers [targetvariable]" or accordingly "users with a low number of tags and an IP in range X areusually non-spammers". Due to its flexible discovery strategy applying an arbitrary anduser-definable quality function, subgroup discovery is easily adaptable for discoveringlocal patterns in the context outlined above. While subgroup discovery is commonlyapplied for concept/class discrimination, by identifying discriminating descriptions ofsubgroups with a (significantly) deviating distribution of the target (class), we will showhow to extend and adapt the method for concept characterization.
In contrast to existing approaches for concept description (e.g., [4, Ch. 4.3]), sub-
group discovery provides the following distinctive features: It can cope with a largenumber of relevant attributes and its search strategy is goal-oriented applying an arbi-trary quality function that can be flexibly defined by the user.
Since subgroup discovery methods are not necessarily covering approaches, sev-
eral of the discovered patterns can cover the same set of instances. In order to presenta selected discriminative high-quality set of subgroups, the patterns should in generalhave an individual high quality, and a low overlap with respect to other competing pat-terns. Thus, the result set of the discovered subgroups can often be reduced by removingirrelevant subgroups [5, 6].
In this paper, we present an approach for concept characterization and discrimi-
nation using local patterns that are obtained using subgroup discovery techniques. Wedescribe the general approach, and also discuss the relation of concept description andinformation retrieval, since in many applications (such as in the case study of this paper)the prominent evaluation metrics come from this context. Since a characterization taskaims to cover the target group as comprehensibly as possible (recall) and a discrimina-tion provides distinctive features of the target population (precision), both requirementscan be captured by the applied method.
The context of the proposed approach is the description of (non-)spammers, i.e.,
characterizing spammers and non-spammers by using their distinctive features but alsotheir common features (with respect to the respective (non-)spammer class). We showhow local patterns for concept description can be retrieved, and we present a case studyapplying real-world data from the social bookmarking system BibSonomy [7].
The rest of the paper is structured as follows: Section 2 provides the necessary
background, introducing subgroup discovery, and the method for removing irrelevantlocal patterns. Next, Section 3 describes the approach for discovering local patterns forconcept characterization and discrimination in detail. After that, Section 4 presents acase study of the proposed approach in the BibSonomy domain. Section 5 discussesrelated work. Finally, Section 6 concludes the paper with a summary and interestingdirections for future research.
In the following, we first introduce the necessary notions concerning the used knowl-edge representation, before we define the subgroup discovery setting, and describe amethod for removing irrelevant patterns.
Let ΩA be the set of all attributes. For each attribute a ∈ ΩA a range dom(a) of valuesis defined. Furthermore, we assume VA to be the (universal) set of attribute values ofthe form (a = v), where a ∈ ΩA is an attribute and v ∈ dom(a) is an assignable value. We consider nominal attributes only so that numeric attributes need to be discretizedaccordingly.
Let CB be the case base (data set) containing all available cases, also often called
instances. A case c ∈ CB is given by the n-tuple c = ((a1 = v1), (a2 = v2), . . . , (an =vn)) of n = |ΩA| attribute values, vi ∈ dom(ai) for each ai.
The main application areas of subgroup discovery are exploration and descriptive in-duction, to obtain an overview of the relations between a (dependent) target variableand a set of explaining (independent) variables. Then, the goal is to uncover proper-ties of the selected target population of individuals featuring the given target propertyof interest. Therefore, not necessarily complete relations but also partial relations, i.e.,(small) subgroups with "interesting" characteristics can be sufficient.
A subgroup discovery task mainly relies on the following four main properties: the
target variable, the subgroup description language, the quality function, and the searchstrategy. An efficient search strategy is necessary since the search space is exponentialconcerning all the possible selectors of a subgroup description. Often, heuristic beamsearch methods but also efficient exhaustive algorithms, e.g., [8], are applied.
In this paper we focus on binary target variables. For multiple classes/concepts we
can equivalently generate multiple binary class-problems. Similar to the MIDOS ap-proach [3] we try to identify subgroups that are, e.g., as large as possible, and have themost unusual (distributional) characteristics with respect to a given concept of interestrepresented by the target variable. The description language specifies the individualsbelonging to the subgroup. The subgroup is thus given by all cases in the data set thatsatisfy its subgroup description. For a commonly applied single-relational propositionallanguage a subgroup description can be defined as follows:
Definition 1 (Subgroup Description). A subgroup description sd = {e1, e2, . . . , en}is defined by the conjunction of a set of selection expressions (selectors). The individualselectors ei = (ai, Vi) are selections on domains of attributes, ai ∈ ΩA, Vi ⊆ dom(ai). We define ΩE as the set of all selection expressions and Ωsd as the set of all possiblesubgroup descriptions.
A quality function measures the interestingness of the subgroup and is used to rank
these. Typical quality criteria include the difference in the distribution of the targetvariable concerning the subgroup and the general population, and the subgroup size.
Definition 2 (Quality Function). Given a particular target variable t ∈ ΩE, a qual-ity function q : Ωsd × ΩE → R is used in order to evaluate a subgroup descriptionsd ∈ Ωsd, and to rank the discovered subgroups during search.
In comparison to the strict support/confidence framework applied for association
rule mining, e.g., [9], the subgroup quality functions can be flexibly defined whichprovides for a powerful tool. Typical quality functions, cf., [2], include the deviation ofthe target share in the subgroup compared to the general population. For binary targetvariables, examples for quality functions are given by
where p is the relative frequency of the target variable in the subgroup, p0 is the relativefrequency of the target variable in the total population, N = |CB | is the size of the totalpopulation, and n denotes the size of the subgroup.
In contrast to the quality function qBT (the classic binomial test), the quality func-
tion qRG only compares the target shares of the subgroup and the total population mea-suring the relative gain. Therefore, a support threshold TSupp is necessary to discoversignificant subgroups.
In addition to the parameters discussed above, other parameters, like the simplicity
of the patterns can be flexibly included into the quality function. Below we will describeformalizations and adaptations of quality functions for concept description.
The result of subgroup discovery is a set of subgroups. Since subgroups can overlap,relevancy analysis is essential in order to identify a compact but comprehensive setof subgroups. As proposed in [10], we consider a subgroup s as more relevant withrespect to another subgroup description s , if it covers all positive examples of s , butno negative examples, that are not covered by s as well. Formally s is irrelevant withrespect to s, if and only if
TP (s ) ⊆ TP (s) and FP (s) ⊆ FP (s ),
where TP (s) is the set of positive examples (containing the target concept) in the sub-group s and FP (s) = s\TP (s) denotes the set of negative examples contained in s.
To identify relevant subgroups we utilize efficient methods for mining relevant pat-
terns as described in [5], based on an efficient vertical bitset-driven subgroup discoveryalgorithm, and the definition of relevant/irrelevant patterns for class-labeled data [10,11]. The applied approach allows for the efficient and effective filtering of irrelevantpatterns during the subgroup discovery process in order to obtain a more diverse set ofsubgroups and can also be adapted to incorporate the handling of exceptions into thestrict relevancy definition.
Local Patterns for Concept Characterization and Discrimination
Concept description is one of the major tasks of descriptive data mining: Also oftenreferred to as class description, it aims to describe a set of individuals in a conciseand compact way, cf., [1]. In contrast to common descriptive data mining tasks such asassociation rule mining, e.g., [9], concept description and subgroup discovery can bothbe regarded as supervised learning tasks since they focus on a specific concept/class ofinterest, or generally on a certain property that we are interested in.
The concept description task can be focused on characterization, that is, discovering
typical properties of the target concept, or on discrimination, that is, identifying theproperties discriminating between the target concept and the non-target elements. Wediscuss these subtasks in detail below, and we show, how both can be mapped to asubgroup discovery task using suitable quality functions. Before that, we define patternrules as a convenient representation formalism for subgroups and subgroup patterns,respectively.
Subgroup patterns, i.e., subgroups with their associated subgroup descriptions can berepresented similar to rules containing the subgroup description in the body of the rule,and the target concept in the head of the rule. Since a subgroup is ranked by a qualityfunction with respect to a specific target property, it can directly be assigned a qualityrating, e.g., according to the value of the quality function. In the following, we definepattern rules, i.e., rules for formalizing subgroup patterns in a rule-like manner and anassociated quantitative quality rating.
Definition 3 (Pattern Rule). A pattern rule r = B(r) → H(r) [q(r)] is defined by thebody B(r) and the head H(r) of the pattern rule, where B(r) ⊆ ΩE, H(r) ⊆ ΩE forwhich the selectors are combined conjunctively, i.e., e1∧· · ·∧ek, (ei ∈ ΩE, i = 1 . . . k). A quality parameter q(r) ∈ R is assigned to the pattern r denoting its respective quality.
Concept Characterization using Subgroup Patterns
The goal of concept characterization is to provide a concise and succinct summary ofa given target concept: By identifying characteristics of a selected target populationtypical properties of the concept can be identified. Thus, the focus is on typical, ornecessary features that occur for (almost) all objects of the concept. In that context,a necessary feature f ∈ Ωsd means, that if target concept t ∈ ΩE is observed, thenf is also observed, i.e., t → f . So, essentially necessary features are contained inall cases of the target concept. However, this usually only happens for few interestingfeatures. Therefore, we will relax this condition for near-necessary features: We applya quality measure qc for expressing the typicality of each pattern with respect to thetarget concept. Then, we can propose a characteristic pattern rule r = t → f [qc(r)]for the pattern that specifies the typicality of the pattern using the quality measure qc.
Concept Discrimination using Subgroup Patterns
In contrast to concept characterization the aim of concept discrimination is to compareor to contrast a given target concept with one or more discriminating concepts. In orderto obtain a comparative summary of the concept the distributions of the target and thecontrasting concepts are compared in the different subpopulations. A perfect contrastis then given by sufficient properties or features: If feature f ∈ Ωsd is observed thenthe target class t ∈ ΩE is also observed, i.e., f → t. So, such a rule also provides forcertain classification given the feature. However, we need to relax this condition usingnear-sufficient features: Similar to concept characterization the discriminating patternsare also assigned a quality parameter based on their discriminative power, which can beused for formalizing discriminative pattern rules.
It is easy to see that subgroup discovery can be naturally applied for concept dis-
crimination, since we consider pattern of the type Body → T in that case: We cansimply map a given subgroup description sd = {e1, e2, . . . , en} to a discriminativepattern rule r = e1 ∧ e2 · · · ∧ en → t [q(r)], with respect to the target concept t. Then,the quality parameter q(r) is determined by the applied quality function qd.
Quality Functions for Concept Description
In the following we describe the relevant parameters of a subgroup pattern. Further-more, we discuss how the different quality functions can be applied both for conceptcharacterization and concept discrimination. In a discriminative setting, let us considera subgroup s, and its equivalent pattern rule r = sd → t [q(r)] with the subgroup de-scription sd of s, and the target variable t. We construct two binary variables T and SDfor the target class cases, and the cases covered by the subgroup description, respec-tively. We can then create a four-fold contingency-table as shown below.
Considering the possible outcomes of the rules, we distinguish the true positives
(tp) for which the pattern correctly predicts the target variable, the false positives (fp)for which the prediction is incorrect, and equivalently the false negatives (fn) and truenegatives for the ‘negation’ of the rule, i.e., for the complement of the prediction. For apattern rule r = t → sd [qc(r )], i.e., a rule that contains the target concept in the bodyof the rule, the entries for fp and fn are simply swapped in the contingency table.
So, the subgroup discovery approach can in principle be directly applied to both
settings, since we just need to insert the right parameters into the quality functions:For the discriminative setting the target share p and the subgroup size n can be easilyobtained using the parameters contained in the contingency table, since p = tp/(tp +fp) and n = tp + fp. Note that this is equivalent to the precision of the pattern knownfrom information retrieval, e.g., [12]. Furthermore, for concept characterization, weobtain an adapted p = tp/(tp + fn) and n = tp + fn since the ‘reference subgroup’consists of all the positives of the total population. This is equivalent to the recall ofthe pattern known from information retrieval, if we consider the subgroup descriptionas the ‘query’ and the instances covered by the subgroup as the result set.
For the discriminative setting we can readily apply all of the usual quality functions
used for subgroup discovery. We can consider the relative gain quality function qRG,for example, which is order-equivalent to precision discussed above. Then, we estimatethe deviation of the distribution of the target concept in the subgroup compared to thegeneral population.
For the characteristic setting, we need to apply special quality functions since we
mainly want to characterize a selected target population, i.e., all the positives of thetarget concept. Therefore, we focus on the positive coverage, i.e., on the coverage ofthe target space, in contrast to discriminative quality functions that take the size of thesubgroup into account corresponding to the coverage of the total population. Thus, forthe characterization task, the distribution of the true positives tp needs to be comparedto the total positives which can be obtained as Pos = tp + fn. For this purpose, forexample, the quality function qT P R measuring the true positive rate (or equivalentlythe recall) can be applied,
which compares the true positives of the subgroup to all positives with respect to thetarget variable. This quality function is equivalent to the function proposed by Han [4,Ch. 4.3], and can be used for estimating the typicality of the subgroup with respectto the target population. Then, subgroups with a large overlap with the target classinstances are selected, without considering the (potentially large) overlap with the non-target instances.
Concept description also concerns the understandability or comprehensibility of the
patterns. Therefore, also the simplicity of the (rule) patterns is a major concern, cf., [13,14]; the syntactic simplicity can be estimated using the length of the descriptions of thepatterns. In order to incorporate this parameter into the applied quality measures q∗, wesimple take the fraction of the original quality function value q(s) of the subgroup sand the length |sd(s)| of the subgroup description sd (s):
In this way, patterns that are described by shorter (and thus simpler) descriptions willbe favored. It is easy to see, that this is especially useful in the case of breaking tiesbetween sets of subgroups with an equal quality ranking.
As discussed above, we can directly apply subgroup discovery for both subtasks of con-cept description, that is, concept characterization and discrimination by choosing suit-able quality functions. These provide different views on the concept description task,because the analyst can focus on the characterization and on the discrimination aspect. However, in order to take both into account, we need to consider both the character-ization and the discrimination, because for description we want to obtain 1) a largecoverage of the target concept, and 2) a significant deviation of the target distributionwithin the subgroup pattern and the whole data set. Essentially, the analyst can navi-gate between the characteristic and discriminative patterns easily, e.g., by tweaking thethreshold parameters, by modifying the set of analyzed attributes, or by adapting theapplied quality function.
However, for a quick comprehensive view there is also an integrated option: As
discussed above, the quality functions for characterization are similar to measuring therecall of the pattern, while the discriminative setting provides pattern with a high pre-cision. Analogously to information retrieval, we can therefore combine quality func-tions for characterization (qc) and for discrimination (qd) using an adapted F-Measure,e.g., [12], resulting in the harmonic mean between qc and qd applying normalized qual-ity functions with a value range in [0; 1]
Equivalently to the F-Measure used in information retrieval, F (qc, qd) now mea-
sures the effectiveness of the concept description with respect to a user who attachesβ times as much importance to qc (characterization) as qd (discrimination); for the
F-Measure β = 1, for equally weighting characterization and discrimination. The βparameter provides for a convenient option for adaptations, and for shifting the focusbetween characteristic and discriminative patterns.
In summary, we can view the concept description task using local patterns from
different perspectives: From the characterization view, from the discrimination viewand from a combined view using the F-Measure. We will discuss several practical op-tions concerning concept characterization and discrimination in the next section whenintroducing the setup and results of the case study.
Case Study – Characterizing (Non-)Spammers
In this section, we discuss the results of the presented approach using a case study in thesocial bookmarking domain. We first introduce the context of the social bookmarkingsystem BibSonomy. Next, we describe the applied data set and discuss the results of theapplication of the presented approach.
Resource sharing systems like BibSonomy provide an easy way to organize and managedifferent kinds of resources. What is considered as a resource depends on the type ofsystem. For instance, in del.icio.us, the resources are URLs, in flickr, the resources arepictures, and in BibSonomy they are either URLs or publication entries. At their core,these systems are all very similar. Once a user is logged in, he can add a resource to thesystem, and assign arbitrary tags to it. The collection of all assignments of a user formthe personomy, the collection of all personomies constitutes the folksonomy. As in othersystems, the user can explore personomies of arbitrary users in all dimensions: For agiven user one can see all resources that have been uploaded, together with the tags thatwere assigned to them; when clicking on a resource one can see which other users haveuploaded this specific resource and how they tagged it; and when clicking on a tag onecan see who assigned it to which resources [7]. Overall, these systems provide a veryintuitive navigation through the data.
In [15] we proposed 25 features in four categories to describe the users and their be-haviour in BibSonomy. For our experiments we focused on the attributes and the cate-gories that are not derived using information about spammers and non-spammers andcan therefore be regarded as ’non-semantic’ socio-demographic features. Therefore, weselected the 15 most interesting attributes (features) from three categories: Profile fea-tures, location-based features, and activity-based features. We shortly repeat these inthis section for the convenience of the reader.
Profile features Table 1 shows profile features that are extracted from the profile of auser which he or she reveals when requesting an account in BibSonomy. Most of the
realnamedigit realname contains digitsrealname2
fields to fill in at registration are not obligatory, however, users need to indicate at least aname and a valid e-mail-address. In contrast to normal users, spammers often use namesor e-mail addresses with many numbers. For instance, typical names of spammers are“styris888” and “painrelief2”. The spam/non-spam distribution of the number of digitsin the username, real name and the email address (namedigit, realnamedigit, maildigit)is very different. The namelen, maillen and realnamelen features refer to the lengthof the usernames and realnames in terms of characters. The realname2 and realname3features are binary and set to one, if the user has indicated two or three names (legitimateusers often register with their full names).
Location based features Location based features refer to describing the user’s locationand domain. Table 2 summarizes the location based features.
Often, the same spammer uses several accounts to publish the same content. These
accounts show the same IP address when they are registered. Thus, if one user with aspecific IP or uses a specific domain is already marked as a spammer, the probabilitythat other users with the same IP or domain are also spammers is higher. When consid-ering the users in the training dataset, we observed this for users of specific domainsand created therefor the features (domaincount, tldcount). The probability that a userwho is from a rare domain which hosts many spammers is also a spammer is higherthan average (and vice versa). For instance, 16 users have registered with the domain“spambob.com” and 137 with the domain “rhinowebmail”, all of which were classifiedas spammers.
domaincount number of users in the same domaintldcount
number of users in the same top level domain
Activity based features Activity properties (Table 3) consider different kinds of userinteraction with the social bookmarking system. While normal users tend to interactwith the system instantly after their registration (e. g., by posting a bookmark), spamusers often wait a certain time after they submit their first post. This timelag can beconsidered when characterizing spam (datediff ).
There are other ‘simple’ properties which we found when manually cleaning the
system from spam. For instance, ‘$group=public’ is added by many spammers, sincethis specific tag is used by a software to generate spam in social bookmarking systems(grouppub). Furthermore, the number of tags per post often varies (tasperpost). Spam-
difference between registration and first post
number of times ’$group=public’ was used
number of total tags added to all posts of this account
mers usually add many different tags to a resource, either to show up more often whensearching for many different tags, or to include ’good’ tags in order to confuse spamdetection mechanisms. Considering the BibSonomy dataset, spammers add in averageeight tags to a post, while non-spammers add three. The average number of TAS (seedefinition in [16] is about 450 for spammers and 250 for users (tascount).
For the case study, we used data from the BibSonomy system, that is, the dataset pro-vided by last years ECML PKDD discovery challenge.1 The original data set contains31715 cases (instances) in total. After removing instances with missing values, theapplied data set contained 31034 instances. As discussed above, we applied the de-scribed 15 attributes for concept description. The distribution of the classes in the ap-plied dataset is highly unbalanced, with 1812 non-spammers and 29222 spammers fordefault target shares of 5.8% non-spammers, and 94.2% spammers. In the following,we will discuss both classes, i.e., spammers and non-spammers as our target concepts,using both characteristic and discriminative features/subgroups.
For the spammer/non-spammer case study we applied the qT P R quality function
measuring the true positive rate, or the recall of the patterns. For the discriminativesetting we applied the relative gain quality function qRG which is order equivalent toprecision. Finally, for assessing the F-Measure, we utilized the classical recall and pre-cision measures. As outlined above, we used adapted measures including the simplicityof the patterns by favoring patterns with shorter descriptions.
1 http://www.kde.cs.uni-kassel.de/ws/rsdc08/dataset.html
Since there exists a variety of other quality functions, especially for the discimi-
native setting, i.e., for class discrimination, we also experimented with other qualityfunctions, e.g., the weighted relative accuracy (WRACC) funtion [17]: The relative gainmeasure obtains very specific (smaller) subgroups compared, e.g., to WRACC, whichare very discriminative and focus on specific local points of the target space. There-fore, the selection and application of the quality functions – especially those used fordiscrimination – is significantly dependent on the user requirements and the goal pa-rameters of interest that are to be included into the quality function: Since the criteriafor the evaluation and classification of spammers of the case study already consideredmeasures from information retrieval (precision and recall) both turned out to be the per-fect candidates and could be directly included in the subgroup discovery techniques. Inaddition, the simplicity of the patterns was also perceived as a very important parameterfor concept description.
In the following sections we present the results of applying the approach for de-
scribing spammers and non-spammers for concept characterization and discrimination.
Describing Non-Spammers: When comparing the attributes that are used for conceptcharacterization and discrimination for non-spammers, we see that mainly date_diff,grouppub, maildigit, maillen, namedigit, realname2, realname3, realnamelen, tld, andtldcount are used for characterization, while date_diff, domaincount, maillen, namelen,realnamelen, tascount, tasperpost, tasperpost, tld, and tldcount are more discriminative. The used value ranges for the features are not always exclusive, which is explained bythe general observation that characterizing features are also often observed for anotherclass, while this is not true for the discriminative features. In general, profile informationseems more important for characterization, while activity-based features seem moreimportant for discrimination.
Figure 1 shows the results of applying the concept characterization task: It is easy to
see, that the discovered subgroups are relatively large, and (by construction) large areasof the target space are covered by the individual patterns. The most important attributesare grouppub, realname3, and namedigit which characterize the non-spammer classwith a value of zero very well. The latter observation can be confirmed by the fact thatmost spammers have numbers in their usernames while non-spammers focus on shortusernames without numbers.
Figure 2 shows the results of applying the discrimination task: As expected, the
discriminative setting focuses on relatively small sections of the target space with highprecision (target share), in contrast to the concept characterization results. The resultingpatterns are significantly discriminative for the target class (non-spammer) and are read-ily available, e.g., for classification or explanation. From the application point of viewone may observe that the number of tags per post is a very important attribute. The mostdiscriminative values for this attribute are 2 and 3 which better fits the intuition that thenon-spammer adds a smaller number of tags to a resource than a spammer. However, 4and 5 is still a discriminative number and appears again in a few patterns. Another veryimportant attribute is data_diff with a value smaller than 7. Typically, non-spammersseem to register and submit their first post thereafter, while spammers tend to register inBibSonomy and wait until they start to use the service they ’recently’ discovered on the
Fig. 1. Concept Characterization of non-spammers. The table shows top 20 subgroup descriptionsfor the target concept class = non-spammer; Size denotes the subgroup size, Quality is measuredby the characteristic relative gain quality function, p/Precision denotes the target share of thesubgroup (precision), T P the number of true positives in the subgroup, and Recall the recallvalue of the subgroup pattern.
internet. Furthermore, the ’tld=de’ features seems to be very important, which can beexplained by the fact that the system is very popular (for legitimate) users in Germany.
In general, while the concept characterization tasks produces descriptions which
focus on demographic features such as the selection of the username, a distinction be-tween different groups of non-spammers can be made with a combination of demo-graphic and activity features. We can learn from this, that a good indicator for non-spammers is already given in the data they provide when registering; however, in orderto reliable classify spammers we need to add information about their interaction withthe system.
Finally, Figure 3 shows the results of applying the combined F-Measure captur-
ing both concept characterization and discrimination. Since the F-Measure combinesboth characterization and discrimination, the results show a balance between the otherresult tables: The focus of the patterns shifts towards more ’precise’ but also more typ-ical features. Considering the selected attributes, date_diff and tldcount appear mostfrequently. Considering the values of the attribute tasperpost, the attribute is still im-portant; however, it only comprises a smaller number of tas (≤ 2), while the differentsubgroups implied by the condition tas (> 2) seem to form a bad general description.
spammer=nonspamm tasperpost=4-5 AND tldcount=116-123
spammer=nonspamm date_diff=0-7 AND tascount=0-1
spammer=nonspamm date_diff=0-7 AND domaincount=140-168
spammer=nonspamm date_diff=0-7 AND tld=de
spammer=nonspamm date_diff=0-7 AND namelen=0-3
spammer=nonspamm date_diff=0-7 AND domaincount=89-90
spammer=nonspamm tasperpost=2 AND tldcount=1009-1312
spammer=nonspamm tasperpost=0-1 AND tld=de
spammer=nonspamm date_diff=0-7 AND maillen=0-13
spammer=nonspamm realnamelen=0 AND tldcount=116-123
spammer=nonspamm date_diff=0-7 AND tascount=3-5
Fig. 2. Concept Discrimination of non-spammers. The table shows the 20 best subgroup descrip-tions for the target concept class = non-spammer, for the discrimination setting. We applied thequality function qRG, i.e., the relative gain quality function; for a description of the remainingparameters see Figure 1.
Describing Spammers: Considering the attributes used for concept discrimination,we see that specific values of domaincount, grouppub, maildigit, namedigit, namelen,realname2, realnamelen, tasperpost, tldcount, and certain top-level domains (tld) arevery good indicators for spammers. For characterization, attributes like date_diff, do-maincount, grouppub, maildigit, maillen, namedigit, realnameX, tascount, tasperpost,tld and tldcount are important, similar to the discriminative setting, but as expected thevalue sets are often more general than the specific patterns used for discrimination.
Figure 4 shows the results of the characterization of spammers:2 While group-
pub=>0 is a perfect feature for discrimination, grouppub=0 is also a good feature forcharacterization, since there is also a large number of spammers with grouppub=0. As expected, spammers often do not enter very many names (realname3=0), but theytend to have long (namelen=>9) names, and a long email (maillen=>17). Addition-ally, spammers often use digits in their names and email (namedigit, maildigit) sinceit seems that they tend to number their created accounts at different sites. As a furthercharacteristic, they often come from the com domain, and use main tas (tascount=>33)and tas per post as well (tasperpost=5-11).
2 These results are also similar to the F-Measure results since spammers form the majority class
and therefore the recall seems to dominate in this setting. Therefore we don’t provide a detaileddiscussion of the F-Measure results.
spammer=nonspamm grouppub=0 AND tldcount=1009-1312
spammer=nonspamm namedigit=0 AND tldcount=1009-1312
spammer=nonspamm maildigit=0 AND tldcount=1009-1312
spammer=nonspamm realname3=0 AND tldcount=1009-1312
spammer=nonspamm date_diff=0-7 AND grouppub=0
spammer=nonspamm date_diff=0-7 AND namedigit=0
spammer=nonspamm date_diff=0-7 AND realname3=0
spammer=nonspamm date_diff=0-7 AND maildigit=0
spammer=nonspamm maillen=>17 AND tld=de
Fig. 3. Concept Description using the F-Measure. The table shows the top 20 subgroup descrip-tions for the target concept class = non-spammer (combined concept description setting).
Figure 5 shows the results of the discriminative description of spammers: While
date_diff is not so important for discriminating spammers than discriminating non-spammers, the tasperpost attribute is also very important. As expected, and as alsoshown by the characterization findings, realnamelen, maildigit, namedigit provide typi-cal features for spammers – usually having digits in their names, and using longer namesin general. Another very discriminative feature is tld. Spammer seems to heavily relyon domains like: th, us, info,and biz in addition to the already mentioned com domain. This complements the patterns observed for the non-spammers.
This paper is especially interesting with respect to two research areas: From a theo-retical point of view, we propose a novel approach for describing concepts based onsubgroup discovery methods. Practically, we apply our method to the application ofspam detection in social bookmarking systems. In this section we will discuss relatedwork for both parts.
Several methods for concept description have been investigated in the past: For
example, attribute-oriented induction techniques, e.g., [1] generate a set of generalizedrelations for obtaining a summary of the task-relevant data. Attribute-oriented inductionfocuses on a specific concept of interest, relies on the specification of the relevant set ofattributes and applies concept hierarchies for concept generalization.
spammer=realname3=0 AND tldcount=>15092
Fig. 4. Concept Characterization of spammers. The table shows 20 best subgroup descriptions forthe target concept class = spammer; as for the non-spammer characterization, we applied the truepositive rate qT P R quality function; for a description of the parameters see Figure 1.
Furthermore, association rule mining [9] can also be regarded as a general con-
cept description task. However, in contrast to subgroup discovery and attribute-orientedinduction, association rule algorithms do not focus on a specific target concept. In rela-tion to to mining association rules, the subgroup discovery task is focused on a specificconcept of interest. Therefore, less (non-interesting) results are generated, and efficientalgorithms can be applied that utilize the concept of interest for pruning the search space[8]. Furthermore, the proposed approach condenses the discovered set of subgroups intoa set of relevant subgroups that still convey the same information as the original set.
Compared to attribute-oriented induction subgroup discovery can especially be ap-
plied for a large number of independent variables/attributes of interest; it can integratebackground knowledge [18] for decreasing the search space and to focus the searchprocess. Thus, it can support much larger search spaces. The discovery process is goal-directed and can be flexibly configured using an arbitrary quality function: Then, duringthe description task, the quality function is directly applied for guiding the search (gen-eralization) process, in contrast to attribute-oriented techniques.
Mining contrast sets, e.g., [19], focuses on discovering conjunctions of attribute-
value pairs that differ meaningfully between groups (concepts) concerning their distri-butions. In that sense, subgroup discovery for binary target concepts can be regarded asa special case of constrast set mining, however, Kralj et al. have inductively shown thatsubgroup discovery and contrast set mining are actually compatible approaches. There-
spammer=spamdomaincount=169-2173 AND realnamelen=1-3
spammer=spamnamelen=5 AND realnamelen=1-3
spammer=spamrealname2=>0 AND realnamelen=1-3
spammer=spamrealnamelen=1-3 AND tasperpost=>11
spammer=spamrealnamelen=13-15 AND tld=biz
spammer=spamtasperpost=>11 AND tld=biz
Fig. 5. Concept Discrimination of spammers. The table shows the top 20 subgroup descriptionsfor the target concept class = spammer, for the discrimination setting. We applied the qualityfunction qRG, i.e., the relative gain quality function; for a description of the remaining parameterssee Figure 1.
fore, the presented pattern discovery task provides for more options: It is more flexibleconcerning the application requirements, weighting recall and precision with respect tothe characterization and discrimination tasks.
Research on spam detection in social media has been conducted by the blog and
wikipedia community. Methods to detect comment spam and spam blogs have beenproposed by [20–22]. [23, 24] are the first to deal with spam in tagging systems ex-plicitly. The authors identify anti-spam strategies for tagging systems and construct andevaluate models for different tagging behaviour. In contrast to [23, 24] we present a con-crete study using data mining techniques to get more insights on a real-world dataset. [15] focuses on the extraction of features suitable to predict spam behavior in socialbookmarking systems by utilizing machine learning approaches.
In this paper, we have presented an approach for concept characterization and discrimi-nation using local patterns that are discovered using subgroup discovery techniques. Wehave shown, how to apply suitable quality functions for the characterization and dis-crimination tasks. Additionally, we presented an option for weighting these sub-goalsin order to provide a combined option for a comprehensive view on the the characteri-zation and discrimination aspects. These can then be analyzed according to the specificrequirements of the application.
We described a case study using real-world data from the BibSonomy system that
provided for a very versatile testbed. The presented approaches reveal interesting in-sights about the specific characteristics of the BibSonomy (non)spammers. For exam-ple, the number of tags per post or the time lag between registration and the first posthelp to distinguish spammers from (non)spammers.
For future work, we plan to extend the case study by comparing different profiles
that were involved in tagging spammers. Then, we can easily perform an assessmentof the inter-annotator agreement of these users. A next step is the extension of thediscovery technique for community mining in social networks by considering the spe-cial (triadic) structure of the user-tag-resource space. Additionally, we plan to integrateapproaches for optimizing the set of patterns, e.g., similar to pattern teams [25]. An-other interesting option for future work is given by considering extended measures forobjective and subjective interestingness criteria besides a potentially more refined sim-plicity/complexity measure.
This work has been partially supported by the German Research Council (DFG) undergrant Pu 129/8-2 and grant GZ STU 252/5-1.
1. Han, J., Fu, Y.: Exploration of the Power of Attribute-Oriented Induction in Data Mining. In
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowl-edge Discovery and Data Mining. AIII Press/MIT Press (1996)
2. Klösgen, W.: Explora: A Multipattern and Multistrategy Discovery Assistant. In Fayyad,
U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in Knowledge Dis-covery and Data Mining. AAAI Press (1996) 249–271
3. Wrobel, S.: An Algorithm for Multi-Relational Discovery of Subgroups. In: Proc. 1st Eu-
ropean Symposium on Principles of Data Mining and Knowledge Discovery (PKDD-97),Berlin, Springer Verlag (1997) 78–87
4. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd Edition. Morgan Kauf-
mann Publishers Inc., San Francisco, CA, USA (2006)
5. Lemmerich, F., Atzmueller, M.: Incorporating Exceptions: Efficient Mining of -Relevant
Subgroup Patterns. Proc. LeGo-09 Workshop at ECML/PKDD 2009 (2009)
6. Atzmueller, M., Puppe, F.: Semi-automatic refinement and assessment of subgroup pat-
terns. In: Proc. 21th International Florida Artificial Intelligence Research Society Conference(FLAIRS-2008). (2008) 323–328
7. Hotho, A., Jäschke, R., Schmitz, C., Stumme, G.: BibSonomy: A social bookmark and
publication sharing system. In: CS-TIW ’06, Aalborg, Denmark, Aalborg University Press(July 2006)
8. Atzmueller, M., Puppe, F.: SD-Map – A Fast Algorithm for Exhaustive Subgroup Discovery.
In: Proc. 10th European Conference on Principles and Practice of Knowledge Discovery inDatabases (PKDD 2006). Number 4213 in LNAI, Berlin, Springer Verlag (2006) 6–17
9. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In Bocca, J.B.,
Jarke, M., Zaniolo, C., eds.: Proc. 20th Int. Conf. Very Large Data Bases, (VLDB), MorganKaufmann (1994) 487–499
10. Garriga, G.C., Kralj, P., Lavraˇc, N.: Closed Sets for Labeled Data. Journal of Machine
Relevancy in Constraint-based Subgroup Discovery.
Jean-Francois Boulicaut, Luc de Raedt, H.M., ed.: Constraint-based Mining and InductiveDatabases. Volume 3848 of LNCS. Springer Verlag, Berlin (To appear, 2006)
12. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge
13. Atzmueller, M., Baumeister, J., Puppe, F.: Quality Measures and Semi-Automatic Mining of
Diagnostic Rule Bases. In: Proc. 15th Intl. Conference on Applications of Declarative Pro-gramming and Knowledge Management (INAP). Volume 3392 of LNAI., Berlin, SpringerVerlag (2005) 65–78
14. Koenig, R., Johansson, U., Niklasson, L.: Using Genetic Programming to Increase Rule
Quality. In: Proc. 21st Intl. Florida Artificial Intelligence Research Society Conference(FLAIRS), AAAI Press (2008) 288–293
15. Krause, B., Schmitz, C., Hotho, A., Stumme, G.: The anti-social tagger - detecting spam in
social bookmarking systems. In: Proc. AIRWeb ’08. (2008)
16. Hotho, A., Jäschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies:
Search and ranking. In: Proc. ESWC ’06, Budva, Montenegro, Springer (June 2006) 411–426
17. Lavrac, N., Kavsek, B., Flach, P., Todorovski, L.: Subgroup Discovery with CN2-SD. Journal
of Machine Learning Research 5 (2004) 153–188
18. Atzmueller, M., Puppe, F., Buscher, H.P.:
Knowledge-Intensive Subgroup Discovery. In: Proc. 19th Intl. Joint Conference on Arti-ficial Intelligence (IJCAI-05), Edinburgh, Scotland (2005) 647–652
19. Kralj, P., Lavrac, N., Webb, G.I.: Supervised Descriptive Rule Discovery: A Unifying Survey
of Contrast Set and Emerging Pattern and Subgroup Mining. Journal of Machine LearningResearch 10 (2009) 377–403
20. Kolari, P., Finin, T., Joshi, A.: SVMs for the Blogosphere: Blog Identification and Splog
Detection. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs(2006)
21. Kolari, P., Java, A., Finin, T., Oates, T., Joshi, A.: Detecting Spam Blogs: A Machine Learn-
22. Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement.
In: Proc. AIRWeb ’05, New York, NY, USA, ACM (2005) 1–6
23. Heymann, P., Koutrika, G., Garcia-Molina, H.: Fighting spam on social web sites: A survey
of approaches and future challenges. IEEE Internet Computing 11(6) (2007) 36–45
24. Koutrika, G., Effendi, F.A., Gyöngyi, Z., Heymann, P., Garcia-Molina, H.: Combating spam
in tagging systems. In: Proc. AIRWeb ’07, New York, NY, USA, ACM (2007) 57–64
25. Knobbe, A.J., Ho, E.K.Y.: Pattern Teams. In: Knowledge Discovery in Databases: Proc.
PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discoveryin Databases. Volume 4213 of LNCS., Springer (2006) 577–584
Produced by Dr. David Voss, Specialist Renal Physician in the interest of public health education. Diabetes mellitus and the kidney Information Sheet What is diabetes mellitus? Diabetes mellitus (often referred to just as diabetes) is a lack of insulin. As a result of the lack of insulin, glucose is not handled properly in the human body, and high blood sugar (glucose) levels result. The high
NADIS Pig Health - October 2006 Rectal Prolapse Compared to other farm species, the pig appears to beparticularly vulnerable to prolapse of the rectal tissue through theanus, which can be seen in any age group from as early as 1-2days old up to adults.The fundamental cause of the prolapse is anincrease in abdominal pressure, forcing a breakdown in the weakmuscular support mechanism o