Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Constraint based Association mining

Constraint based Association mining

Constraint-based rule miners find all rules in a given dataset meeting user-specified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all user-specified constraints including minimum support, minimum confidence, and a new constraint that ensures every mined rule offers a predictive advantage over any of its simplifications. Our algorithm maintains efficiency even at low supports on data that is dense (e.g. relational data). Previous approaches such as Apriori and its variants exploit only the minimum support constraint, and as a result are ineffective on dense data due to a combinatorial explosion of “frequent item sets”.

 

Mining rules from data is a problem that has attracted considerable interest because a rule provides a concise statement of potentially useful information that is easily understood by end users. In the database literature, the focus has been on developing association rule algorithms that identify all conjunctive rules meeting user-specified constraints such as minimum support (a statement of generality) and minimum confidence (a statement of predictive ability). These algorithms were initially developed to tackle data-sets primarily from the domain of market-basket analysis, though there has been recent interest in applying these algorithms to other domains including telecommunications data analysis, census data analysis, and classification and predictive modeling tasks in general. Unlike data from market-basket analysis, these data-sets tend to be dense in that they have any or all of the following properties:

 

• Many frequently occurring items (e.g. sex=male);

• Strong correlations between several items;

• Many items in each record.

 

These data-sets can cause an exponential blow-up in the resource consumption of standard association rule mining algorithms including Apriori and its many variants. The combinatorial explosion is a result of the fact that these algorithms effectively mine all rules that satisfy only the minimum support constraint, the number of which is exorbitant. Though other rule constraints are specifiable, they are typically enforced solely during a post-processing filter step. Our approach to mining on dense data-sets is to instead directly enforce all user-specified rule constraints during mining. For example, most association rule miners allow users to set a minimum on the predictive ability of any mined rule specified as either a minimum confidence or an alternative measure such as lift (also known as interest or conviction. Our algorithm can exploit such minimums on predictive ability during mining for vastly improved efficiency. Even given strong minimums on support and predictive ability, the rules satisfying these constraints in a dense dataset are often too numerous to be mined efficiently or comprehended by the end user. To remedy this problem, our algorithm exploits another constraint that eliminates rules that are uninteresting because they contain conditions that do not (strongly) contribute to the predictive ability of the rule. To illustrate this useful concept, first consider the rule below:

 

Bread & Butter Milk (Confidence = 80%)

 

This rule has a confidence of 80%, which says that 80% of the people who purchase bread and butter also purchase the item in the consequent of the rule, which is milk. Because of its high confidence, one might be inclined to believe that this rule is an interesting finding if the goal is to, say, understand the population of likely milk buyers in order to make better stocking and discounting decisions. However, if 85% of the population under examination purchased milk, this rule is actually quite uninteresting for this purpose since it characterizes a population that is even less likely to buy milk than the average shopper. This point has motivated additional measures for identifying interesting rules including lift and conviction. Both lift and conviction represent the predictive advantage a rule offers over simply guessing based on the frequency of the consequent. But both measures still exhibit another closely related problem illustrated by the next rule.

 

Eggs & Cereal Milk (Confidence = 95%)

 

Because the confidence of this rule (95%) is significantly higher than the frequency with which milk is purchased (85%), this rule will have lift and conviction values that could imply to the end-user that it is interesting for understanding likely milk buyers. But suppose the purchase of cereal alone implies that milk is purchased with 99% confidence. We then have that the above rule actually represents concise rule which is more broadly applicable (because there are more people who buy cereal than people who buy both cereal and eggs). To address these problems, our algorithm allows the user to specify a minimum improvement constraint. The idea is to mine only those rules whose confidence is at least minimp greater than the confidence of any of its proper sub-rules, where a proper sub-rule is a simplification of the rule formed by removing one or more conditions from its antecedent. Any positive setting of minimp would prevent the undesirable rules from the examples above from being generated by our algorithm. More generally, the minimum improvement constraint remedies the rule explosion problem resulting from the fact that in dense data-sets, the confidence of many rules can often be marginally improved upon in an overwhelming number of ways by adding additional conditions. For example, given the rule stating that cereal implies milk with 99% confidence, there may be hundreds of rules of the form below with a confidence between 99% and 99.1%.

 

Cereal & & & & Milk

 

By specifying a small positive value for minimp, one can trade away such marginal benefits in predictive ability for a far more concise set of rules, with the added property that every returned rule consists entirely of items that are strong contributors to its predictive ability. We feel this is a worthwhile trade-off in most situations where the mined rules are used for end-user understanding.

No comments:

Post a Comment