Decision Support Systems in Health Care – Velocity of Apriori Algorithm

Mario SOMEK^a1 and Mira HERCIGONJA-SZEKERES^a

aUniversity of Applied Health Sciences, Zagreb, Croatia

Abstract. The amount of stored data in health information systems can reach tera- and petabytes and application of specific algorithms in the field of data mining makes finding useful information suitable for making quality business decisions. A frequently used method for determining the rules of the relationship between attributes is the Association rule by applying Apriori algorithm. Lack of basic Apriori algorithm is derived from the slow work due to multiple scanned data sets.

By examining the speed of generating the basic rules in relation to the improved Apriori algorithm by using software RapidMiner confirmed that the time required to generate rules for Improved algorithm is shorter, the rules are quickly generated particularly for large data sets, which is an advantage for making decisions.

Keywords. Decision support,generate rules, Apriori algorithm, Health care

1.Introduction

Ambulances, laboratories, outpatient clinics, expert opinions in the field of health care are the source of information with an exponential trend of collecting and conducting successful treatments. Correct structuring of data and usage of various analysis techniques and related algorithms of information may be used to make better business decisions in which the credibility of obtained rules is proportional to the amount of data.

Data mining (DAP) [1], known as knowledge discovery in data sets, a complex extraction of information from a potentially useful and unknown data stored in databases. Association rules [2] allow the generation of rules based on the relationship observed attributes depending on their values. Basic Apriori algorithm (AA, Apriori) that is used has a drawback that stems from multiple data scanning and resulting is a slow work. Application of advanced and improved analysis techniques in business certainly enhances business processes, especially when working with large data sets that are commonly present in healthcare.

The paper will show a version of improved AA and analysis of multiple data sets in the field of health with the use of RapidMiner software to examine the extent to which the size of the data will affect quick generation of rules for basic and improved AA.

1 Corresponding author, M. Somek, University of Applied Health Sciences, Zagreb, Croatia; E-mail:

mariosomek@gmail.com

The Practice of Patient Centered Care: Empowering and Engaging Patients in the Digital Era R. Engelbrecht et al. (Eds.)

This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).

doi:10.3233/978-1-61499-824-2-53

2.Disadvantages of basic Apriori

From the moment of AA application (R. Agrawal, 1994.) to date deficiencies were observed that are reflected in reduced efficiency at work [4] and it was also sought to eliminate the development of new, advanced, algorithms based on the basic algorithm:

AprioriTid, AprioriHybrid [5].

In their papers, many authors describe advanced Apriori algorithms and point out the shortcomings of basic ones. X. Fang [4] points out two main disadvantages:

scanning large databases leads to reduced efficiency due to large inputs, i.e. the data that must be loaded and when re-scanning data sets, the algorithm does not use the results from the pre-scan. S. Rao and P. Gupta [6] state that if in the first scan the number of occurrences of each attribute is 10⁴, then Apriori generates 10⁷ pairs of attributes stored and tested for frequency of occurrence, which requires huge resources and multiple scan data sets.

3.Association rule

Association rules of one of the DAP methods allows determination of rules of association of individual attributes with defining of parameters of support, which is the lowest frequency of occurrence of each attribute in the data set and confidence, which is the ratio of the number of records that contain two attributes and the number of records that contain one of these attributes, one that is defined as conditional.

Algorithm that is applied is Apriori [3] and the data scanning process consists of stages of unification and trimming. In finding the rules, algorithm finds frequent items in the database with multiple passes through the data set. Initially it goes looking for frequency of occurrence of one item based on parameter support conducts, trimming and unification. Then, it searches for occurrence of two items (pair), three items, etc., always performing the trimming considering the parameter support.

4.Improved Apriori

The version of improved Apriori will be described by example [7] which contains a set of ten data tracks (Z0…Z9), with each containing two to five attributes (A1…A5).

Each entry is marked by primary key ID_Z (Z0…Z9). The input parameter is minimal support 4.

Table 1. The initial set of data.

Id_Z Attribute

Z0 A1,A2,A3

Z1 A1,A4,A5

Z2 A1,A2,A3,A5

Z3 A3,A5

Z4 A1,A2,A3,A4,A5

Z5 A1,A3

Z6 A1,A2,A4

Z7 A2,A3

Z8 A1,A2,A3

Z9 A3,A5

M. Somek and M. Hercigonja-Szekeres / Decision Support Systems in Health Care 54

In the initial scanning, the algorithm counts the appearance of each attribute, and in the following step eliminates those that do not meet the minimum requirements according to the support parameter. In ten transactions, the attribute A4 occurs three times and does not meet the minimal support requirement. Table 2. shows all attributes with associated support and records where they also appear.

Table 2. Extract individual attributes (1-element).

Attribute Support Id_Z

In the next step, according to the Table 3., all attribute pairs are separated and their occurrence is counted. The basic AA here scans every record, while the improved scans only those where attributes occur according to the minimal support condition to the additional Min parameter. In the example of pair attribute (A1, A2) according to the Table 2., the attribute A2 has less support (Min) than attribute A1, so the records where A2 appears are separated. Further set scanning eliminates those attribute pairs that do not meet the minimal support (4).

Table 3. Extract pairs of attributes (2-elements).

Attribute (pairs) Support Min. Id_Z

A1,A2 5 A2 Z0,Z2,Z4,Z6,Z7,Z8

As in the previous step, the unification forms trio of attributes and eliminates those that do not meet the minimal support requirement (Table 4).

Table 4. Extract triplets attributes (3-elements).

Attribute (triplets) Support Min. Id_Z

A1,A2,A3 4 A2 Z0,Z2,Z4,Z6,Z7,Z8

A1,A2,A5 2 A5 Z1,Z2,Z3,Z4,Z9

A1,A3,A5 2 A5 Z1,Z2,Z3,Z4,Z9

A2,A3,A5 2 A5 Z1,Z2,Z3,Z4,Z9

Generation of associative rules on the basis of minimum confidence occurs after the verification of support parameters for each record combination (individual, pair and trio). In the stated example, the amount of scanned records of the starting data set with basic and improved AA application followed by examination of occurrence of individual, pairs and triplets of attributes is presented in Table 5.

Table 5. Number of scanning with basic and improved AA.

Basic Apriori ImprovedApriori

1 attribute 50 50

2 attributes (pairs) 60 34

3 attributes (triplets) 40 21

Total 150 105

At the initial scan of records with individual attributes, the number of scans with both algorithms is equal, and the difference in the amount of scanned data is

M. Somek and M. Hercigonja-Szekeres / Decision Support Systems in Health Care 55

proportional to the number of attributes generated by unification (double, triple, quadruple, etc.). In total, the number of crossings of the improved AA for the stated example is almost 30% less than in the basic AA.

5.Testing data

The analysis of the speed of generated rules used actual, publicly available data on the amount of performed ambulatory interventions in each medical field in hospitals in America. [8]. The data stored in Access database was used and exported in Excel table that was used in analysis of RapidMiner application. The original data set contains 15 attributes and 4119 records. The data was reduced for the purpose of analysis and converted in a numeric type and all missing values were removed. [9]. Final starting analysis collection contains 9 attributes and 3405 records. For the purpose of this paper, the amount of records in the second set was doubled, while in the third tested set the amount was increased three times in relation to the starting set (Table 6).

RapidMiner process (Figure 1.) consists of operators for: data import, frequency discretization with two classes, transformation of numerical into binominal values, search of frequent data sets and operators that generate association rules. The support parameter is defined in the operator’s settings for the search of frequent data sets, and confidence in the operator’s settings for the generation of association rules.

The testing was conducted on a computer with the Intel Core i3 processor, with 3,4Ghz frequency and 3GB of RAM with installed 32-bit Windows10 operating system. The RapidMiner 5 version of program support has an open code and is based on Java platform.

Figure 1. Analysis process with applied operators in RapidMiner.

6.Test results

Table 6. shows the testing results for the data sets with different numbers of records with appropriate parameters of minimum support and confidence. The number of attributes and the minimal support have equal values with each tested set.

Basic AA needs the most time for the total scan of data sets with the largest number of entries and confidence parameter 0.6 while the least amount of time is necessary for the set with the least amount of records and confidence parameter 0.9.

Only 2.2 seconds were necessary for the scanning of a set with 3405 records with

M. Somek and M. Hercigonja-Szekeres / Decision Support Systems in Health Care 56

confidence 0.9. The difference in the speed of the generating rule with medium and largest set with the reliability 0.9 is 1.3 seconds, and with the application of the improved Apriori in the second scanning of data sets it will be less, in the third even less, etc. Assuming that on the basis of Min parameter the set in the second scan is half the size shows that the rule generating time is less by approximately 20%.

Table 6. Test results for different data sets.

No. of

Improved AA allows faster generation of rules, because after the initial scan of the entire set, in the next rounds through sets it uses additional support parameter and scans a portion of the initial data set. The difference in the number of scans between the basic and improved AA increases with the increase of the attribute combination whose frequency the algorithm examines. Time saving in generation of rules, particularly with the analysis of large data sets is more pronounced, and the application of the improved eliminates the shortcomings of the basic AA.

In addition to correct selection of DAP procedure and interpretation of the obtained rules, the speed of generating rules becomes an important factor for making quality business decisions in today’s e-business in the healthcare sector.

References

[1] P.Mandave, M. Mane, S.Patil, Data mining using Association rule based on Apriori algorithm and improved approach with illustration, International Journal of Latest Trends in Engineering and Technology (2013),Vol. 3 Issue2, 107-113.

[2] P. S.Kumar, A. K. Panda, Use of Association rule mining in higher secondary education in Odisha, International Journal on Advanced Computer Theory and Engineering (2013), Volume-2, Issue-6, 31-35.

[3] P.Agrawal, S.Kashyap, V. C.Pandey, S. P.Keshri, A review approach on various form of Apriori with Association rule mining, International Journal on Recent and Innovation Trends in Computing and Communication (2013), Volume: 1 Issue: 5, 462-468.

[4] X.Fang, An improved Apriori algorithm on the frequent itemse, Conference on Education Technology and Information System, Atlantis Press, USA, 2013, 845-848.

[5] T. A. Kumbhare, S. V. Chobe, An overview of Association rule mining algorithms, International Journal of Computer Science and Information Technologies (2014), Vol. 5 (1), 927-930.

[6] S. Rao, P.Gupta, Implementing improved algorithm over Apriori data mining Association rule algorithm, International Journal of Computer Science And Technology (2012), Vol. 3, Issue 1, 489-493.

[7] M. Al-Maolegi, B.Arkok, An improved Apriori algorithm for Association rules, International Journal on Natural Language Computing (2014), Vol. 3, No.1, 21-29.

[8] https://www.medicare.gov, access 7.10. 2016.

[9] K.Pandole, N.Bhargava, Comparison and evaluation for grouping of null data in database based on K-means and Genetic algorithm, International Journal of Computer Technology and Electronics Engineering (2012),Volume 2, Issue 3, 204-209.

M. Somek and M. Hercigonja-Szekeres / Decision Support Systems in Health Care 57

A Case Study of the Technology Use and

In document THE PRACTICE OF PATIENT CENTERED CARE: EMPOWERING AND ENGAGING PATIENTS IN THE DIGITAL ERA (Sider 67-72)