ASSOCIATION RULE- A TOOL FOR DATA MINING


By
Praveen Ranjan Srivastava
Lecturer (Computer Science)
Banasthali Vidyapith (Deemed University)
Banasthali-304002 (Rajasthan)
Phone : 01438-228787
E-mail : spraveen@banasthali.ac.in / pra_ranjan@rediffmail.com
 


ABSTRACT:

Data mining is the process of extracting knowledge from large amount of databases whether database is relational, temporal, spatial, multimedia etc. Data mining is very popular technology some companies like IBM, ORACLE, and TCS are working on data mining. Data mining engine having different kinds of approach like classification, clustering, association, outliers analysis. Each approach has a different technology like classification based on decision tree induction whether association based on interesting pattern generation. Some people are saying data mining is the part of knowledge discovery in databases (KDD). We can say Data mining tools perform data analysis and may uncover important data patterns. This paper has a full attention on association rule. Association rule is very important tool for mining process it has two special chartersticis first one is support and another is confidence. Support gives total number of transaction of any particular item are occurring in datasets while confidence gives strength of a data in a dataset, we can say support is probability of A and B while confidence is conditional probability. Association rule based on these two characteristics. Different algorithms support association rule but this paper show only two very popular approach Apriori and FP-tree method. Both approach gives frequent patterns, and candidate generation; frequent pattern means those item sets that satisfy minimum support value. We can say in one-word association rule gives an "interesting pattern"; this paper give and tries how association rules generate interesting patterns form a huge amount of databases.

INTRODUCTION:

The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amount of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for application ranging from business management, production control, and market analysis, to engineering design and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining used different types of engine such as classification, clustering, association, and outliers .In classification class label is known whether in clustering class label is unknown, association gives to interesting pattern of the data while outliers give fraud detections.

What is Data Mining?

Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledgebase's, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools. So simply you can say data mining refers to extracting or "mining" knowledge from large amount of data. 

KDD (Knowledge Discovery in Databases): Many people treat data mining as a synonym for another popularly used term, i.e. KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. KDD process is depicted in figure. 
 


1. Data Cleaning: To remove noise and inconsistent data.

2. Data integration: Where multiple data source may be combined.

3. Data Selection: Where data relevant to the analysis task are retrieved from the database.

4. Data Transformation: Where data are transformed

5. Data mining: An essential process where intelligent methods are applied in order to extract data pattern.

6. Pattern Evaluation: To identify the interesting pattern.

7. Knowledge presentation: Where visualization and knowledge representation techniques are used.


Architecture of a data mining system: The Typical data mining architecture may have following major component.
 


The data mining components are

  • Database, data warehouse, or other information repository
  • Database, data warehouse server
  • Data mining engine
  • Patterns evaluation model
  • Graphical user interface

What is association analysis? Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for market basket or transition data analysis.

Association rule having two main important properties.

  • Support
  • Confidence

The definition of the support and confidence is

                                        Support (AB) = P (AUB)

                                      Confidence (AB) = P (B|A).

If we correlate support and confidence then

                Confidence ( AB) = P (B|A) = Support_ count (AUB)/Support_ count (A)

Where Support_ count (AUB) is the number of transaction containing the item sets AU B, and Support_ count (A) is the number of transactions containing the item set A.

"How are association rules mined from large databases?" Association rule mining is a two –step process:

    1. Find all frequent item sets: By definition, each of these item sets will occur at least as frequently as a predetermined minimum support count.

    2. Generate strong association rules from the frequent item sets: By definition, these rules must satisfy minimum support and minimum support and minimum confidence.

APRIORI ALGORITHM: (Finding Frequent Item sets Using Candidate Generation)

Apriori is an influential algorithm for mining frequent item sets. The name of the algorithms is based on the fact that the algorithm uses prior knowledge of frequent item sets properties. Apriori employs an iterative approach known as a level-wise search.

To improve the efficiency of the level-wise generation of frequent item sets, an important property called the apriori property, i.e." all nonempty subsets of a frequent item sets must also be frequent."

Apriori algorithms having a two-step process.

    • The join step: To find  Lk , a set of candidate k item sets is generated by joining Lk-1  with itself. This set of candidate is denoted Ck.
    • The prune step:  Ck is the superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. A  scan of the databases to determine the count of each candidate in Ck would result in the determination of Lk. (i.e. all candidates having a count no less than the minimum support count are frequent by definition, and therefore belongs to Lk

procedure AprioriAlg()
begin

    L1 := {frequent 1-itemsets};
    for ( k := 2; Lk-1 0; k++ ) do {
        Ck= apriori-gen(Lk-1) ; // new candidates
    for all transactions t in the dataset do {
       for all candidates c Ck contained in t do
                 c:count++
       }
       Lk = { c Ck | c:count >= min-support}
    }
    Answer := k Lk

end

It makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent pass, say pass k, consists of two phases. First, the frequent itemsets Lk-1 (the set of all frequent (k-1)-itemsets) found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the apriori-gen() function. This function first joins Lk-1 with Lk-1, the joining condition being that the lexicographically ordered first k-2 items are the same. Next, it deletes all those itemsets from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck.

The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hash-tree data structure and increments the count of those candidates. At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk . The algorithm terminates when Lk becomes empty.

FP-TREE GROWTH ALGORITHM:

Apriori algorithms suffer from the following two shortcomings:

    1. It is costly to handle large numbers of candidate sets. For instance, 104 frequent 1-itemsets, then approximately, 107   candidate 2-itemsets are generated.

    2. It is tedious to repeatedly scan the database and check a large set of candidates by pattern matching.

Keeping this in mind, a new class of algorithms has recently been proposed which avoids the generation of large numbers of candidate sets. We describe one such method, called the FP-tree growth algorithm. It is proposed by Han et al. The main idea of the algorithm is to maintain a frequent pattern tree of the databases.

A frequent pattern tree (or FP-tree) is a tree structure consisting of an item-prefix-tree and a frequent item-header table.

  • Item- prefix-tree:
    • o It consists of a root node labeled null

      o Each non-root node consists of three fields:

        * Item name

        * Support count

        * Node link.

         

  • Frequent-item –header-table: It consists of two fields;
      • * Item name

        * Head of node link which points to the first node in the FP-tree

        * Carrying the item name.

Association rules should not be used directly for prediction without further analysis or domain knowledge. They do not   necessarily indicate causation. They are however a helpful starting point for further exploration, making them a popular tool for understanding data.

CONCLUSION:

The discovery of association relationship among huge amount of data is useful in selective marketing, decision analysis, and business management.  A popular area of application is market basket analysis, which studies the buying habits of customers by searching for sets of item that are frequently purchased together. Association rule mining consists of first finding frequent item sets, from which strong association rules in the form of A B are generated. Association rule is the important tool for data mining engine. It is very popular technology now days. This paper have used only two approaches for association rule for mining the process named apriori and fp-tree. These two algorithms are very popular for mining rule. Last but not least we can say association rule gives interesting pattern to our customers. 

References:

 


Praveen ranjan srivastava
Lecturer (Computer Science)
Banasthali Vidyapith (Deemed University)
Banasthali-304002 (Rajasthan)
Phone : 01438-228787
Email : spraveen@banasthali.ac.in / pra_ranjan@rediffmail.com
 

Source : E-mail March 20, 2004

 

B A C K

 

Important Note :
Site Best Viewed in Internet
Explorer in 1024x768 pixels
Browser text size: Medium