Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Hemodialysis Key Features Mining and Patients Clustering Technologies

Hemodialysis Key Features Mining and Patients Clustering Technologies Hindawi Publishing Corporation Advances in Artificial Neural Systems Volume 2012, Article ID 835903, 11 pages doi:10.1155/2012/835903 Research Article Hemodialysis Key Features Mining and Patients Clustering Technologies Tzu-Chuen Lu and Chun-Ya Tseng Department of Information Management, Chaoyang University of Technology, Wufeng District, Taichung 41349, Taiwan Correspondence should be addressed to Tzu-Chuen Lu, tclu@cyut.edu.tw Received 3 March 2012; Revised 4 June 2012; Accepted 8 June 2012 Academic Editor: Anke Meyer-Baese Copyright © 2012 T.-C. Lu and C.-Y. Tseng. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The kidneys are very vital organs. Failing kidneys lose their ability to filter out waste products, resulting in kidney disease. To extend or save the lives of patients with impaired kidney function, kidney replacement is typically utilized, such as hemodialysis. This work uses an entropy function to identify key features related to hemodialysis. By identifying these key features, one can determine whether a patient requires hemodialysis. This work uses these key features as dimensions in cluster analysis. The key features can effectively determine whether a patient requires hemodialysis. The proposed data mining scheme finds association rules of each cluster. Hidden rules for causing any kidney disease can therefore be identified. The contributions and key points of this paper are as follows. (1) This paper finds some key features that can be used to predict the patient who may has high probability to perform hemodialysis. (2) The proposed scheme applies k-means clustering algorithm with the key features to category the patients. (3) A data mining technique is used to find the association rules from each cluster. (4) The mined rules can be used to determine whether a patient requires hemodialysis. 1. Introduction When renal function is abnormal, toxins can be pro- duced, damaging organs and possibly leading to death. To The human kidney is located on the posterior abdominal extend or save the lives of patients with impaired kidney wall on both sides of the spinal column. The main functions function, kidney replacement is typically utilized, including of the kidney include metabolism control, waste and toxin kidney transplantation, hemodialysis (HD), and peritoneal excretion, regulation of blood pressure, and maintaining the dialysis (PD). Although kidney transplantation is the most body’s fluid balance. All blood in the body passes through the clinically effective method, few donor kidneys are available kidney 20 times per hour. When renal function is impaired, and transplantation can be limited by the physical conditions the body’s waste cannot be metabolized, which can result in of patients. Notably, HD can extend the lives of kidney back pain, edema, uremia, high blood pressure, inflamma- patients. tion of the urethra, lethargy, insomnia, tinnitus, hair loss, Although medical technology is mature, factors causing blurred vision, slow reaction time, depression, fear, mental diseases are changing due to changing environments. Any disorders, and other adverse consequences. Furthermore, an factor may potentially lead to disease. When the detection impaired kidney will produce and secrete erythropoietin. index of a patient exceeds the standard and kidney disease When secretion of red blood cells is insufficient, patients will has been diagnosed, patients must go the hospital for kidney have the anemia. The kidney also helps maintain the calcium replacement therapy. For instance, a doctor may recommend and phosphate balance in blood, such that a patient with that high-risk patients adjust their habits by, say, stopping renal failure may develop bone lesions. smoking, controlling blood pressure, maintaining normal 2 Advances in Artificial Neural Systems urination, controlling urinary protein levels, maintaining toxins accumulate. For chronic kidney failure, medical normal sleeping patterns, controlling blood sugar levels, treatment is first utilized and HD may be initiated after reducing the use of medications, avoiding reductions in the uremia occurs. Additionally, a doctor may assess according to body’s resistance, maintaining low body fat levels, and reduc- the causes of kidney failure, kidney size, anemic state, degra- ing the burden on the kidneys. dation of kidney function, and recovery. Moreover, each However, improving one’s physical condition and diet examination indicator will be assessed. The most commonly are insufficient. To control one’s physical condition, periodic used indicators are BUN concentration, CRE concentration, health examinations at a hospital have become a common CC, urine-specific gravity, and osmotic pressure [1, 2]. disease-prevention strategy. Doctors may offer advice to patients based on health examination results to reduce dis- 2.1.1. Blood Urea Nitrogen (BUN). Blood urea nitrogen is ease risk. the metabolite of proteins and amino acids excreted by the Many scholars have applied data mining techniques kidneys. The BUN concentration in blood can be used to for disease prediction. These techniques include clustering, determine whether kidney function is normal. The normal association rules, and time-series analysis. Different analyses BUN range is 10–20 mg/dL. If the BUN concentration may require different mining techniques. Selection of an exceeds20mg/dL,thisiscalledhighazotemia. However, appropriate mining technique is the key to obtaining valu- the BUN concentration may increase temporarily because able data. However, choosing a data mining technique is very of dehydration, eating large amounts of high-protein foods, difficult for general hospitals, especially when dealing with upper gastrointestinal bleeding, severe liver disease, infec- different forms of original data. Therefore, to help medical tion, steroid use, and impaired kidney blood flow. When professionals identify hidden factors that cause kidney the BUN concentration is high and the CRE concentration diseases, this work applies a novel hemodialysis system (HD is normal, kidney function is normal. Although the BUN system). The HD system may identify factors not previously concentration can be used as an indicator of kidney function, known. it is not as accurate as the CRE concentration and CC. General medical staff may perform routine examinations for particular factors associated with a particular disease 2.1.2. Creatinine (CRE). Creatinine is mainly a metabolite and ignore other factors that may be associated with other diseases, such as kidney diseases. For example, staff may of muscle activity and daily production is excreted through only assess blood urea nitrogen (BUN) and creatinine (CRE) the kidneys. Daily CRE production cannot be fully excreted levels and CRE clearance (CC). However, increasing amounts and the CRE concentration increases when TRY kidney of data indicate that some hidden rules and relationships function is impaired. As the CRE concentration increases, may exist. Therefore, this work uses an entropy function kidney function decreases. Because CRE is a waste generated to identify key features related to HD. By identifying these by muscle metabolism, the CRE concentration is associated key features, one can determine whether a patient requires with the total amount of muscle or weight but is not related HD. This work uses these key features as dimensions in to diet or water intake. The CRE concentration may reflect cluster analysis. When patients requiring HD are classified kidney function more accurately than the BUN concentra- into the same group, and the other patients are classified into tion. When the CRE concentration is in the normal range, the other group, the key features can effectively determine it does mean that kidney function is normal; that is, CC is whether a patient requires HD. The proposed data mining a better tool when assessing kidney function. The compen- scheme finds association rules of each cluster. Hidden rules satory capacity of the kidney is large. For example, although for causing any kidney disease can therefore be identified. the CRE concentration may increase from 1.4 mg/dL to 1.5 mg/dL, kidney function may have declined by more than 2. Literature Review 50%. 2.1. Hemodialysis. Hemodialysis is also called dialysis. An 2.1.3. Creatinine Clearance (CC). Creatinine clearance is artificial kidney discharges uremic toxins and water to widely used and is an accurate estimation of kidney function. eliminate uremic symptoms. In an HD system, a semi- permeable membrane separates the blood and dialysate. The Creatinine Clearance is the amount of CRE cleared per minute. The CC for a healthy person is 80–120 mL/min; the human blood continues passing through on one side of an artificial kidney and the dialysate carries away uremic toxins average is 100 mL/min. Kidney failure is minor when the on the other side. Finally, the cleaned blood will back into the CC is 50–70 mL/min and moderate when CC is only 30– body.Thiscontinuouscycle eventually purifiesblood. 50 mL/min. If CC is <30 mL/min, kidney failure is severe A doctor may recommend that patient undergo dialysis and uremic symptoms will develop gradually. When CC is according to the difference between acute and chronic. If <10 gradually, a patient must start dialysis. By collecting all kidney failure is acute, the doctor will recommend that the the urine produced within 24 hours, CC can be determined patient undergo dialysis before the occurrence of uremic easily. Notably, CC is derived as follows: Urine CRE mg% × 24 hours urine volume (c.c.) concentration CC = . (1) ( ) Blood CRE mg% × 1440 minutes concentration Advances in Artificial Neural Systems 3 Table 1: Kidney function test features. 2.1.4. Urine-Specific Gravity and Osmotic Pressure. Urine- specific gravity and osmotic pressure reflects the ability of Kidney function test items Reference Units the kidney to concentrate urine. If the specific gravity of Blood urea nitrogen BUN 5–25 mg/dL urine is ≤1.018 or each urine-specific gravity gap is ≤0.008, Creatinine CRE 0.3–1.4 mg/dL the ability of the kidney to concentrate urine is impaired. Uric acid UA 2.5–7.0 mg/dL Moreover, the ratio of osmolality to blood osmotic pressure Albumin-globulin in must exceed 1.0; otherwise, the ability of the kidney to A/G ratio 1.0–1.8 ratio concentrate urine is impaired. If the ratio of urine to blood M: 71–135 Creatinine osmotic pressure is ≤3 after water fasting for 12 hours, CC mL/min clearance/24 hrs urine F: 78–116 the ability of the kidney to concentrate urine is impaired. Renin Penin 0.15–3.95 pg/mL/hr Abnormal urine concentration function usually occurs in patients with analgesic nephropathy. Creatinine urine Creatinine urine 60–250 mg/dL Doctors recommend patients undergo dialysis when their Natrium Na 135–145 meq/L BUN concentration exceeds 90 mg/dL, the CRE concentra- Potassium K 3.4–4.5 meq/L tion exceeds9mg/dL, andCCis <0.17 mL/sec, or the CRE Calcium Ca 8.4–10.6 mg/dL concentration exceeds 707.2 mg/dL. However, when the BUN IP 2.1–4.7 mg/dL Phosphorus concentration begins increasing, the kidney is very fragile. ALP 27–110 U/L Alkaline phosphatase That is, the kidney that has been damaged exceeds 1/3 when HD is required [3]. Thus, indexes such as the albumin Table 2: Blood test features. globulin ratio (A/G ratio) of kidney function (Table 1), red blood cell (RBC) count in blood tests (Table 2), or white Blood test items Reference Units M: 14–18 blood cell (WBC) count by urinalysis (Table 3) are related to Hemoglobin Hb g/dL kidney function [1]. This work proposes an effective scheme F: 12–16 that identifies unknown key features to predict HD. This M: 450–600 Red blood cell RBC mil/mm work uses the entropy function to identify key features that F: 400–550 are strongly related to HD and applies the k-means clustering White blood cell WBC 5000–10000 mm algorithm to these key features to group patients. M: 40–55 Hematocrit Hct % Hung proposed an association rule mining with multiple F: 37–50 minimum supports for predicting hospitalization of HD Platelets PLT 15–40.0 10 /uL patients [4]. Hung used this association rule to analyze Mean corpuscular MCV 83–100 u factors that may lead to HD to reduce the number of patients volume hospitalized for kidney impairment. Mean corpuscular MCH 27–32.5 uug hemoglobin Hung relied on routinely examined HD indexes for patients per month, including BUN, CRE, uric acid (UA), Mean corpuscular hemoglobin MCHC 32–36 % natrium (Na), potassium (K), calcium (Ca), phosphate (IP), concentration and alkaline phosphatase levels and analyzed 667 derived Reticulocyte Reticulocyte 0.5–2.0 % variables, such as protein ratio, to determine whether mono- Malaria (−) Malaria cytes infected or a patient was undernourished. Hung Erythrocyte M: 1–15 obtained 9 rules from 5,793 records. For instance, diabetic ESR mm/hr sedimentation Rate. F: 1–20 patients with high cholesterol levels were hospitalized most. Differential count DC Inadequate dialysis was a high risk factor for hospitalization. Band Band 0–2 % If patient is female, aged 40–49, infected with monocytes, Neutrophils Neutrophils 50–70 % and had a recent hemoglobin (Hb/Ht) test value that Lymphocytes Lymphocytes 20–40 % was too low, the frequency of hospitalization was high. If Monocytes Monocytes 2–6 % hematocrit (Ht) was abnormal twice in the last three months, average platelet volume (MPV) was abnormal twice, and Eosinophils Eosinophils 1–4 % Basophils Basophils 0–1 % total protein (TP) was abnormal once, the probability of hos- pitalization was 93%. If TP, glutamic oxaloacetic transami- Bleeding times BT 0–3 Minute Coagulation times CT 2–6 Minute nase (GOT), and glutamic pyruvic transaminase (GPT) of patients were abnormal twice in the last three months and Blood type Blood type uric acid was also abnormal, hospitalization risk was 100%. Rhesus factor Rh Factor (+) Huang analyzed risk of mortality for patients on long- Blood pressure BP mm/Hg term HD in 2009 [5]. Huang used the Classification and Height Height cm Regression Tree, Mann-Whitney U Test, Chi-square Test, Weight Weight kg Pearson Correlation, and the Nomogram to analyze 992 patients on long-term HD. Albumin level and age were the factors most strongly related to mortality. Huang clustered that of nondiabetic patients. However, if a patient was and analyzed patients. If a patient had good nutrition and malnourished and older, albumin and CRE levels were the was young, mortality of diabetic patients was 5.45 times factors most strongly related to mortality. Thus, albumin 4 Advances in Artificial Neural Systems Table 3: Urine test features. We assume a classification problem that includes N data records, m feature dimensions, and k clusters. The mea- Urine test items Reference Units surement of a single feature’s information gain must be Color/appearance Color/appearance determined based on two correlated values, called entropy; Reaction pH Reaction PH 5.5–8.5 the difference between two correlated values is called infor- Protein Protein <(+) mg/mL mation entropy Sugar Sugar (−)g/dL Bilirubin BIL (−) Entropy (N) = P × log =− p × log p , (2) t t t Urobilinogen URO ≤1; 4 umol/L |D | Urine red blood cells RBC 0–3 /HPF jv (3) Entropy D = × Entropy D , j jv Urine white blood cells WBC 0–5 /HPF v=1 Pus cell Pus cell 0-1 /HPF Gain D = Entropy (N) − Entropy D . (4) j j M: 0–3 Epith cell Epith cell /HPF F: 0–15 In (2), Entropy (N) is the total information content of whole Casts Casts Not found /LPF problems, and this total information content is taken as a Ketones Ketones (−) mmol/L basis of single feature information gain, in which P is the Crystals Crystals −∼ (±) /LPF probability of occurrence of t classification in N dataset. Bacteria and other Bacteria and other − /HPF In (3), Entropy (D ) is the information content of jv the j feature dimension, the v value, and classification and information quantity, D is the j feature dimension, jv level, age, diabetes status, and CRE level can help predict risk including v kinds of values, and the j feature dimension has of mortality. |D | values. Yeh et al. used a data mining technique to predict hos- In (4), Gain (D ) is a classification problem, the informa- pitalization of HD patients in 2011 [6]. The availability of tion gain received by the j feature dimension. Through (2)– medical resources and dialysis quality may decline when too (4), the information gain of each feature for a classification many patients are admitted to a hospital. Therefore, Yeh et problem is found. This work then evaluates all threshold set- al. used analysis of the C4.5 decision tree and the multiple tings and collects the features with the greatest information minimum support (MS) association rule mining technology gain to form a feature set for classification. Entropy is used for analysis. The C4.5 decision tree was used to eliminate to identify key features and cluster HD patients to determine null values and association rule mining was used to identify the accuracy of key features. hospitalization of HD patients. According to the records of hospitalized patients, hospitalized patients seldom have a 2.3. Clustering Algorithm. Although many clustering tech- chronic disease or may not have a chronic disease, but doc- niques have been proposed, the k-means algorithm is the tors only determine whether a patient should be hospitalized most representative and widely applied [9]. The k-means during an examination. algorithm is also called the generalized Lloyd algorithm Lin used hospital records of patients combined with the (GLA) [10]. The k-means algorithm transforms each data association rule and the time-series analysis to establish a record into a data point and random numbers are utilized health-management information system for chronic diseases to generate the initial cluster center to determine which data [7]. Lin found that occluded cerebral arteries may lead to point belongs to which cluster point. The divided data points cerebral thrombosis and a cerebral embolism. After exami- are used to calculate the distance between a data point and nation by a doctor, the rule is effective in avoiding a second the cluster center, such that a data point will belong to one stroke. Additionally, ill-defined heart diseases still require cluster center when the data point is closer to one cluster improvement. Lin used data mining to provide the chronic center than another cluster center. The newly recomputed disease patients’ family members and medical staffsfor con- cluster center is the average among all data points in a cluster, trolling their disease. and the new cluster center is taken as a basis for the next These scholars usually used well-known blood tests as iteration. This process is repeated until no change occurs. mining rules. This work uses an effective and novel scheme The steps of the k-means algorithm are as follows. to identify some previously unknown features to predict HD. (1) Use random numbers to generate the initial cluster The entropy function is applied to identify features that are centers C ={1, 2,... , k}. strongly related to HD, and the k-means clustering algorithm (2) Calculate the Euclidean distance d(X, C )for each is applied with these key features to group patients. i data point X ={x , x ,... , x } and each cluster 1 2 m center C . The point with the shortest distance is 2.2. Entropy Function. Information gain, proposed by Quin- classifiedinto C , and the distance formula is as fol- lan in 1979 [8], is a basis of the decision tree constructed by lows: Interactive Dichotomiser 3 (ID3). Information gain can also be utilized to determine differences in feature attributes and 2 d(X, C ) = x − c . (5) i j ij other classification attributes. Further, it is usually used to j=1 select the split point of ID3. Advances in Artificial Neural Systems 5 (3) Recompute the new cluster center C . If the move- to find the key features that are strongly related to ment of all data points in a cluster stop moving, all diseases. clustering work stops; otherwise, steps (1) and (2) are (3) The mining procedure is also divided in two subpro- repeated for clustering. cedures. For clustering analysis, one subprocedure, the clustering algorithm is applied to these key fea- 2.4. Association Rule. An associationruleisawidely used tures to group patients. For the association rule, the technique. It progressively scans a database to identify other subprocedure, the Apriori algorithm is applied rules for the relationships between items. For instance, the to find the association rule in each cluster. probability that people will buy bread after buying milk is (4) The output procedure may express the entire mining milk → bread (support = 50% and confidence = 100%); result, and a medical professional will explain the support means that the probability of a consumer buying mining result, and find any factor that may cause a both milk and bread is 50%, and confidence means that the disease. probability of a consumer buying bread after buying milk is 100%. 3.1. Input Procedure. Examination information is from many Agrawal et al. developed the Apriori algorithm in 1994 sources, such as a hospital information system (HIS), lab- [11]. The Apriori algorithm is one of the most popular data oratory information system (LIS), or Excel report. These mining methods, where I is all itemsets, each data record is different systems may have different data storage formats. X ={x , x ,... , x },and X ⊆ I. The expression of the asso- 1 2 m For example, in the A database, gender is 1 for male and 2 ciation rule is x → x (support, confidence), where x ⊆ 1 2 1 for female, but in the B database, M is for male and F is I, x ⊆ I,and x ∩ x = ϕ.Support andconfidence affect 2 1 2 for female. Thus, an error may occur while collecting data. mining results most. Support is the occupied percentage for Therefore, one should apply the preprocess process to ensure N data records and the probability of occurrence of both x that information is correct, complete, and sufficient. The and x is (x ∪ x )/N. Confidence is the probability of x and 2 1 2 1 preprocess process is divided into five steps. x and is called a strong association rule. First, set the threshold of minimum support and min- (1) Unified data storage format: to simplify mining, all imum confidence to generate frequently occurring items, information must be in the same format. where L represents frequently occurring b-itemsets, and all (2) Irrelevant data: if one does not specify the mining generated L frequent itemsets are combined to generate topic, mining efficiency and even accuracy will be candidate itemsets. Only the support and confidence values adversely affected. that are greater than the minimum support and minimum (3) Incorrect data: incorrect data may be caused by a confidence thresholds are retained. This process is repeated source error or login error; thus, one should modify until all L frequent itemsets are identified. or remove. (4) Formats do not match: to smooth information min- 3. Proposed Algorithms ing, information must be converted into an appropri- This work applies a novel and effective scheme to find key ate format when necessary. features that predict HD. This work uses the entropy function (5) Incomplete data: incomplete data is a common prob- to find the key features that are strongly related to HD and lem; for example, some information may be lost, applies the k-means clustering algorithm with these key fea- lacking for a certain period. tures to group patients. Furthermore, the proposed scheme applies the data mining technique to identify association 3.2. Preprocess Procedure. Data are standardized to improve rules from each cluster. These rules can be used to warn analytical accuracy. A standard value may be applied to patients who may require HD. Figure 1 shows the system an item such as triglycerides (TG). If the TG level is architecture, which is divided into four procedures. ≥201 mg/dL, it exceeds and the standard is 100; if TG is These procedures are as follows. normal it is in the range of 20–200 and the standard is 50; if TG is smaller than <19 mg/dL, it is lower than the standard (1) The input procedure, which should be handled very and the standard is 0. If data are consecutive, a packing carefully, can determine the disease target and input normalization method is used; its formula is as follows: various sources and formats into a database. This procedure has a marked impact on the subsequent v − min j j procedure. v = × Q , (6) max − min j j (2) The preprocess procedure is divided into two sub- procedures. For quantitative processing, one subpro- where v represents raw data, min is the minimum value of j j cedure, data are converted into an appropriate ana- j,max is the maximum value of j, v is the packing nor- lytical form; for example, a string form is converted malized value, and Q is quantified distance. Table 4 shows into a numeric form, or a numeric form is converted example data after quantization. into a similar spacing. For selecting features, the other Table 4 is a normalized form used to derive information subprocedure, this work uses the entropy function gain and in association rule analysis, and it can effectively 6 Advances in Artificial Neural Systems Input procedure Input data HIS systems data Data warehouse LIS systems data Streamline Sampling Excel report Quantitative Preprocess procedure Extreme value Select features Picking Entropy function No Yes Information sort Excluded Outliers Mining procedure Data mining Filtering (missingNum > minMissing) Clustering analysis Association rule analysis Output procedure Result Figure 1: The system architecture. differentiate between patients. This work simultaneously the influence of outlier values, this work sets a minNum uses extreme value normalization; its formula is threshold for each record. For example, assume minNum = 3 is the threshold. The total number of hemoglobin (HB), v − min j j v = × 100, which is quantified as 2 (HB = 2),is9;however,thatofHB, (7) max − min j j which is quantified as 0 (HB = 0), is 1. This means that most data are assigned to HB = 2, and only 1 datum is assigned where v represents raw data, min is the minimum value j j to HB = 0. The total number of quantified values that are of j,max is the maximum value of j,and v is the packing smaller than minNum is the extreme value. This scheme normalized value. For instance, if the WBC value is 1, max = replaces the extreme value with the average value. 10.7, and min = 3.5, then v = [(1 − 3.5)/(10.7 − 3.5)] × 100 = 38.89% can be derived by applying (7). In the entire database, the maximum and minimum 3.3. Information Gain Analysis. This work uses dialysis item values of each item markedly affect the quantification result, to identify information gain. For example, 6 patients are on and the values are called outliers. If outliers exist, anomalies dialysis (Dialysis = 1) (Table 4), the occurrence probability will also exist; for example, suppose that Q of CRE is 80, and is P = 6/15, and information gain is P × log(1/P ) = 1 1 1 CRE values are generally 0.37–2.99; however, a polarization (6/15) × log(6/15) = 0.528771. When 9 patients are nondi- datum may occur when a record is 6990. After quantization, alysis (Dialysis = 0), occurrence probability is P = 9/15, values in the range of 0.37–2.99 will be quantified as 1, and information gain is P × log(1/P ) = (9/15) × log(9/15) = 0 0 the value recorded as 6990 will be assigned 80. Therefore, 0.442179, and total information gain of P and P is 0 1 this work creates a mechanism to remove outliers. To avoid 0.970951. Advances in Artificial Neural Systems 7 Table 4: Packing method normalized data. Sex Age WBC RBC HB BUN CRE UA GOT GPT TP ALB GLO A/G TG Dialysis 25 4 3 3 4 2 2 4 5 2 2 2 2 3 2 11 3 1 1 1 3 1 0 0 0 0 0 0 0 1 1 20 1 3 1 0 0 0 0 0 1 1 0 1 0 1 1 30 1 1 0 0 1 0 0 1 1 0 0 0 0 1 1 40 2 0 1 0 0 0 0 2 0 0 0 0 0 2 0 51 3 1 1 1 2 1 0 3 4 1 1 1 0 1 0 61 1 1 1 2 1 1 1 0 0 0 0 0 0 1 1 70 4 2 0 0 1 1 0 0 0 1 0 1 0 1 0 81 1 1 1 2 3 1 0 2 4 0 0 0 1 2 1 91 2 0 1 2 2 1 1 1 2 0 0 0 1 1 0 10 0 2 3 1 0 2 0 1 2 1 0 0 1 0 2 0 11 1 0 1 2 2 1 0 0 1 1 1 1 1 0 1 0 12 0 0 1 1 0 3 0 0 0 0 0 0 0 0 1 0 13 0 2 1 2 0 0 0 0 0 1 1 0 1 0 1 1 14 1 2 2 0 1 1 1 1 1 0 0 1 0 1 1 0 15 1 2 3 1 2 3 1 1 1 1 0 1 0 1 1 0 Table 5: Calculation information gain of sex relative to dialysis. Sex j Dialysis Count (D ) P P × log(1/P ) Entropy (D ) Entropy (D ) jν D D D jν j jν jν jν 04 4/7 0.46 0 0.99 0.459773 13 3/7 0.52 05 5/8 0.42 1 0.95 0.509031 13 3/8 0.53 Sum 0.968804 Next, this work calculates the information gain of each each record. If missingNum > minMissing, then the record item relative to dialysis item. Take Sex (Table 5) as an exam- is removed. Otherwise, missingNum  minMissing, the ple. The Sex of 7 women is 0 (Sex = 0) and only 4 records with record will be retained and the missing null values will = 0), the probability is P = 4/7of be replaced by the mean value. For instance, Age, WBC, non-dialysis (Dialysis jv and BUN are the top three key features when records are Sex = 0 and Dialysis = 0, and information gain is 0.46. Three missing records. Assume minMissing is 1. When a record for records have Sex = 0 and Dialysis = 1; thus, the probability which missingNum > 1, the record is removed; otherwise, the P = 3/7, and information gain is 0.52. Total information jv record is retained and the missing null values are replaced by gain of 0.46 and 0.52 is 0.99. Information gain of the women the mean value. is 0.99 × (7/16) = 0.459773 because the probability of Sex = 0 is 7/16. After summing the information gain of the women (Sex = 0) and men (Sex = 1), total information gain 3.4.2. Clustering. This work uses key features for clustering, is 0.968804, where 0.968804 = 0.459773 + 0.509031. Next, where x , x ,... , x as m key features, X ={x , x ,... , x } 1 2 m 1 2 m are patient records, x is a key feature in X,1 ≤ j ≤ m,and k via (3), whichisEntropy(N) − Entropy (D ), Gain (D ) = j j j 0.970951 − 0.968804 = 0.002147. is the cluster number. The k-means process is as follows. The information gain of each item related to dialysis can (1) First, randomly generate k initial cluster centers C = be obtained and ranked, and the association rule can be {c , c ,... , c }. Figure 2(a) has ten solid circles, N = 1 2 m mined using the top few items as key features. Take Table 6 10, which are the locations of each record, and three as an example. Assume that the top three items are chosen. triangles, k = 3, which are the locations of cluster Thus,Age,WBC,and BUNare takenaskey features. centers C . m 2 (2) Apply (5), d(X, C ) = (x − c ) ,tocalculate i j ij j=1 3.4. Data Mining Procedure the distance between each patient’s data point X and the cluster center C . When some X distance d is less 3.4.1. Missing Values. Some patients may have missing val- i 1 than d , X will be classified to C . i 1 ues. If their records are removed directly, some import infor- mation may be lost. Thus, this work applies a second filter (3) Let C ={X , X ,... , X } be a cluster center i c 1 c 2 c S i i i before data mining analysis. This research sets minMissing membership, where S is the total number of members as the threshold and takes missingNum as a null value of in C ,and X is u patient’s data point in C .Thus, i c u i i 8 Advances in Artificial Neural Systems Table 6: Information gain of each item. Items Sex Age WBC RBC HB BUN CRE UA GOT GPT TP ALB GLO A/G TG Gain 0.002 0.577 0.329 0.14 0.06 0.28 0.05 0.09 0.18 0.2 0.05 0.24 0.02 0.06 0.03 in cluster C is S, and each cluster membership is C = i i {X , X ,... , X }; thus, the u patient’s data point X is in c 1 c 2 c S c, u i i i the C , and the j key features x are in X ={x , x ,... , i uj c u u1 u2 x }. Next, the association rule is used to analyze each cluster um C . (1) First, set the values of minimum support minSup and minimum confidence minConf. (2) Convert the normalization table into an extreme C values table. 1 C (3) Find the candidate set. We assume α = item ,where jp item is the p quantified value of the j key feature in jp C ,1  p  Q, and Sup(α) denotes the occurrence probability of item in C .IfSup(α)  minSup, then jp item becomes a candidate itemset L and proceed jp to the next step. (4) Through candidate set L ={α , α ,... , α }, generate z 1 2 y (a) Initial dataset and cluster center (Before) a set of two items, L ={α ∪ α , α ∪ α ,... , α ∪ y 1 2 1 3 1 α , α ∪ α ,... , α ∪ α };however, α and α y 2 3 y−1 y A B cannot be the same item. Calculate the occurrence newC probability of each group, Sup(α ∪ α ). If Sup(α ∪ A B A α ) > minSup, it becomes a member of frequent itemset L . Z+1 newC (5) Take L as a candidate set and repeat step (4) until the candidate set is null. (6) Generate the association rule of the frequent itemset. If the confidence of the rule exceeds minConf, the rule is set up and the process is as follows. ∗ ∗ (i) Let α be one of the frequent itemsets L , α = (R ∪ R ). A B newC (ii) Generate rules R → R and R → R . A B B A In the case of A clustering C , where minSup = 2and minConf = 0.5, the key features are item = Age, item = 1 2 (b) Center displacement (After) WBC, and item = BUN, and S = 7 is the total number of records in C . Thus, this work finds the frequent itemsets L A 1 Figure 2: The diagram of clustering algorithm. using the minSup and minConf thresholds. The proposed scheme merges two items by L as a candidate set, where j = Age, p = 3in α ,and j = WBC and p = 1in α , and then 1 3 newC will be added to the sum of X in each i c S calculates Sup(α ∪α ). If Sup(α ∪α )  minSup, then let α i 1 3 1 3 1 s s and α be the two frequent itemsets until no more frequent C , and newC ={( x /S), ( x /S),... , i i u1 u2 3 u=1 u=2 ( x /S)} can then be obtained. This function itemsests are found. uj u=1 can also be taken as a new cluster center. Next, the quantified values are converted back into their original values if all rules are found; the formula is (4) Repeat steps (2) and (3) until each C remains the same. v = × max − min +min, (8) Q j j j 3.4.3. Association Rule. Next, the proposed scheme finds each clustering characteristic rule using Apriori association where v is a quantified value, min is the minimum value of rule analysis. We assume that the total number of records j,max is the maximum value of j, v is the original value, j j Advances in Artificial Neural Systems 9 and Q is a quantified interval. Take WBC = 1 → Age = 3 Table 7: Each interval of item. as an example rule. If the max of WBC is 10.7, the min j j ID Item Interval value is 3.5, and Q is 4; then the original value of WBC = 1is 1TG 50 WBC = 1/4×(10.7−3.5)+3.5 = 5.3. If the max value of Age 2AST(GOT) 20 is 68, the min value is 30, and Q is 5; then the original value j j 3Ch 50 of Age = 3isAge = 3/5× (68− 30) + 30 = 52.8. Through (8), 4ALT(GPT) 20 the association rule WBC = 1 → Age = 3 can be transformed 5UA 2 into WBC = 5.3 → Age = 52.8. 6 K (Boold) 2 4. Experimental Results 7BUN 5 8Amylase(B) 50 This experiment uses health examination records provided ... ... ... by hospitals. The data are mainly for outpatient dialysis and general outpatients. The hospital has 105 records with many values missing. This is because each patient does not undergo all examinations. Therefore, data must first be filtered to Table 8: Clustering results. eliminate records with missing values. This work adopts Cluster UA AST (GOT) TG K (Blood) Density BUN and CRE, which are related to kidney function, as the Cluster-1 6.54 24.48 119.72 5.10 14.14 first filter. If any null value occurs in BUN or CRE, the record Cluster-2 6.16 30.12 138.92 3.92 11.59 is removed. In total, 18,166 records are retained after the first filtering. Cluster-3 4.47 24.72 112.33 4.07 11.22 The purpose of quantification in the preprocess proce- Cluster-4 8.40 28.03 228.72 4.20 20.91 dure is to convert values into a continuity value or significant difference value from a finite interval. This work sets interval Q for each item based on recommendations by medical staff. Table 7 shows the intervals. collects the features with the greatest information gained to form a feature set for classification. Entropy is used to 4.1. Choose Key Features. The mining result does not make identify key features and cluster HD patients to determine the sense when too many items are used. The proposed scheme accuracy of key features. During the clustering process, the uses the Entropy function to identify the top 4 key features clustering algorithm is applied on these key features to group between each item and dialysis; these features are are UA, patients, and the entropy function can effectively determine AST (GOT), TG, and K (Blood). clustering analysis with the key features. Furthermore, this work applies the apriori algorithm to find the association 4.2. Mining Procedure rules of each cluster. Hidden rules for causing any kidney disease can therefore be identified. 4.2.1. Clustering Analysis. Based upon the above clustering This experiment adopts the health examination records algorithm, this work applies the k-means clustering algo- provided by one general hospital of Taiwan. During the rithm with these key features to group patients. Before the experiment process, the experimental results will be dis- experiment, records with many missing values were filtered cussed with medical staffs. From the experimental results, we out, leaving 7118 records. Table 8 shows the cluster grouping can find that if BUN is in the range of 58.5–61.5 (60± 1.5) result. For example, 1169 patients are classified into the first and Na (Blood) is in the range of 137.5–140.25 (140± 2.5), group. The average indicator values are UA = 6.54, AST patients have a high risk of receiving a dialysis. The BUN (GOT) = 24.48, TG = 119.79, and K (Blood) = 5.10, and is reported to be a reliable indicator of high risk, but the the average density of the first group is 13.26. The average Na (Blood) is not clearly defined. Therefore, the Na (Blood) difference among all groups is 27.02, which is the best result needs for further analysis and clarification. Conversely, if UA of 100 random trial runs. is in the range of 6.25–6.75 (6.5± 0.25), TG is in the range of 134.75–184.75 (159.75±25),and K(Blood)isinthe range 4.2.2. Association Rule Analysis. This work identifies the top of 3.89–4.39 (4.14± 0.25), or AC-GLU is in the range of four itemsrelated to dialysisasTG, AST(GOT),UA, andK 111–161 (136± 25), patients have a low risk of receiving a (Blood); AST (GOT) is the main indicator of liver function. dialysis. These four items are adopted as key features and the The medical staffs express that the UA, TG, and AC-GLU association rule technique is applied to analyze each group will definitely affect the possibility of patients to receive a rule after clustering, where minSup = 35% and minConf = dialysis, but K (Blood) is not clearly defined to create an 65%. The association rules of the four clusters are shown in influence on patients. The factor should be further analysis. Table 9. At last, there is one more special feature, AST (GOT) because 4.3. Summary. This work uses the clustering algorithm and it appears both in the groups of high risk and low risk. The the association rule algorithm to identify some previously medical staffs express, actually AST (GOT) is not directly unknown features of HD patients and possible association relatedtoHD. Thus,AST (GOT)isnot akey factor to rules. This work then evaluates all threshold settings and determine whether a patient requires HD. 10 Advances in Artificial Neural Systems Table 9: Association rule of each cluster-k. ∗ ∗ α Sup (α)Conf. Cluster-1 (k = 1) BUN = 60 ± 1.5 → Dialysis = Yes 487 91% Dialysis = Yes → AST (GOT) = 24.5 ± 10 708 74% AST (GOT) = 24.5 ± 10 → Dialysis = Yes 523 73% Na (Blood) =140 ± 2.5 → Dialysis = Yes 455 70% Dialysis = Yes→ BUN = 60 ± 1.5 487 69% Na (Blood) = 140 ± 2.5 → AST (GOT) = 24.5 ± 10 434 66% Cluster-2 (k = 2) CRE = 0.85 ± 0.15 → Dialysis = No 487 91% UA = 6.5 ± 0.25 TG = 159.75 ± 25 → Dialysis = No 1341 97% AC-GLU = 136 ± 25 → Dialysis = No 1265 94% TG = 159.75 ± 25 → Dialysis = No 1696 93% UA = 6.5 ± 0.25 → Dialysis = No 1920 93% AST (GOT) = 45 ±10 → Dialysis = No 1479 92% K (Boold) = 4.14 ± 0.25 → Dialysis = No 1938 91% TG = 159.75 ± 25 Dialysis = No→ UA = 6.5 ± 0.25 1341 79% TG = 159.75 ± 25 → UA = 6.5 ± 0.25 1378 76% TG =159.75 ± 25 → UA = 6.5 ± 0.25 Dialysis = No 1341 74% UA = 6.5 ± 0.25 Dialysis = No→ TG = 159.75 ± 25 1341 70% UA = 6.5 ± 0.25 → TG = 159.75 ± 25 1378 67% UA = 6.5 ± 0.25 → TG = 159.75 ± 25 Dialysis = No 1341 65% CRE = 0.85 ± 0.15 → Dialysis = No 487 91% UA = 6.5 ± 0.25 TG = 159.75 ± 25 → Dialysis = No 1341 97% Cluster-3 (k = 3) CRE = 0.85 ± 0.15 → Dialysis = No 732 100% CRE = 0.85 ± 0.15 K (Boold) = 5 ± 0.25 → Dialysis = No 560 100% K (Boold) = 4.14 ± 0.25 → Dialysis = No 910 95% AST (GOT) = 24.5 ± 10 K (Boold) = 4.14 ± 0.25 → Dialysis = No 507 94% AST (GOT) = 24.5 ± 10 → Dialysis = No 505 92% AST (GOT) = 24.5 ± 10 → Dialysis = No 679 86% CRE = 0.85 ± 0.15 → K (Boold) = 4.14 ± 0.25 560 77% CRE = 0.85 ± 0.15 Dialysis = No→ K (Boold) = 4.14 ± 0.25 560 77% CRE = 0.85 ± 0.15 → K (Boold) = 4.14 ± 0.25 Dialysis = No 560 77% AST (GOT) = 24.5 ± 10 Dialysis = No→ K (Boold) = 4.14 ± 0.25 507 75% Dialysis = No → K (Boold) = 4.14 ± 0.25 910 74% AST (GOT) = 24.5 ± 10 → K (Boold) = 4.14 ± 0.25 539 68% Cluster-4 (k = 4) AST (GOT) = 45 ± 10 K (Boold) = 4.14 ± 0.25 → Dialysis = No 364 98% K (Boold) = 4.14 ± 0.25 → Dialysis = No 503 91% AST (GOT) = 45 ± 10 → Dialysis = No 537 90% K (Boold) = 4.14 ± 0.25 Dialysis = No → AST (GOT) = 45 ± 10 364 72% Dialysis = No → AST (GOT) = 45 ± 10 537 71% AST (GOT) = 45 ± 10 Dialysis = No→ K (Boold) = 4.14 ± 0.25 364 68% K (Boold) = 4.14 ± 0.25 → AST (GOT) = 45 ± 10 372 68% Dialysis = No→ K (Boold) = 4.14 ± 0.25 503 67% K (Boold) = 4.14 ± 0.25 → AST (GOT) = 45 ± 10 Dialysis = No 364 66% Advances in Artificial Neural Systems 11 5. Conclusion Medical staffs try to find some information from patient’s health examination records to reduce the occurrence of disease. However, some hidden information may be ignored because of the human observation or the restriction of book. Although there are many data mining techniques that have been proposed, most of them are focused on some known items. Seldom techniques in regard with searching for hidden key features are proposed. The reason is because the examination items are too many but incomplete. It is hard to find out the association rule by using system. This research will help medical staffstofind some unknown key features to predict the hemodialysis. We apply k-means clustering algorithm with these key features to group the patients. Furthermore, the proposed scheme applies data mining technique to find the association rule from each cluster. The rules can help the patients to detect any occurrence possibility of disease. Acknowledgment The authors would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this paper under Contract no. NSC 99-2622-E-324-006- CC3. References [1] DrKao, “Normal Test Values,” 2010, http://www.drkao.com/ 1st site/health wap/normal main.htm. [2] Green Cross, “How to Detect Renal Function,” 2010, http:// www.greencross.org.tw/kidney/symptom sign/kid func.html. [3] Shin Kong Wu Ho-Su Memorial Hospital, 2010, http://www .skh.org.tw/mnews/178/4-2.htm. [4] K. C. Hung, Multiple minimum support association rule mining for hospitalization prediction of hemodialysis patients [M.S. thesis], Computer Science and Information Engineering, 2004. [5] S. Y. Huang, The evaluation & analysis of the risk of mortality for patients receiving long-term hemodialysis proposal [M.S. thesis], Graduate Institute of Biomedical Informatics, 2009. [6] J. Y. Yeh, T. H. Wu, and C. W. Tsao, “Using data mining tech- niques to predict hospitalization of hemodialysis patients,” Decision Support Systems, vol. 50, no. 2, pp. 439–448, 2011. [7] Y. J. Lin, Applying data mining in health management infor- mation system for chronic desease [M.S. thesis],Departmentof Computer Science and Information Management, 2008. [8] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. [9] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, andA.Y.Wu, “Anefficient k-means clustering algorithms: analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, 2002. [10] J. Z. C. Lai, T. J. Huang, and Y. C. Liaw, “A fast k-means clus- tering algorithm using cluster center displacement,” Pattern Recognition, vol. 42, no. 11, pp. 2551–2556, 2009. [11] R. Agrawal, R. Srikant, H. Mannila et al., “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, pp. 307–328, 1996. Journal of Advances in Industrial Engineering Multimedia Applied Computational Intelligence and Soft Computing International Journal of The Scientific Distributed World Journal Sensor Networks Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Advances in Fuzzy Systems Modelling & Simulation in Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Submit your manuscripts at http://www.hindawi.com Journal of Computer Networks and Communications  Advances in  Artic fi ial Intelligence Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Biomedical Imaging Artificial Neural Systems International Journal of Computer Games Advances in Advances in Computer Engineering Technology Software Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Reconfigurable Computing Computational Advances in Journal of Journal of Intelligence and Human-Computer Electrical and Computer Robotics Interaction Neuroscience Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Advances in Artificial Neural Systems Hindawi Publishing Corporation

Hemodialysis Key Features Mining and Patients Clustering Technologies

Loading next page...
 
/lp/hindawi-publishing-corporation/hemodialysis-key-features-mining-and-patients-clustering-technologies-Bt10fYxTqK

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2012 Tzu-Chuen Lu and Chun-Ya Tseng. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-7594
DOI
10.1155/2012/835903
Publisher site
See Article on Publisher Site

Abstract

Hindawi Publishing Corporation Advances in Artificial Neural Systems Volume 2012, Article ID 835903, 11 pages doi:10.1155/2012/835903 Research Article Hemodialysis Key Features Mining and Patients Clustering Technologies Tzu-Chuen Lu and Chun-Ya Tseng Department of Information Management, Chaoyang University of Technology, Wufeng District, Taichung 41349, Taiwan Correspondence should be addressed to Tzu-Chuen Lu, tclu@cyut.edu.tw Received 3 March 2012; Revised 4 June 2012; Accepted 8 June 2012 Academic Editor: Anke Meyer-Baese Copyright © 2012 T.-C. Lu and C.-Y. Tseng. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The kidneys are very vital organs. Failing kidneys lose their ability to filter out waste products, resulting in kidney disease. To extend or save the lives of patients with impaired kidney function, kidney replacement is typically utilized, such as hemodialysis. This work uses an entropy function to identify key features related to hemodialysis. By identifying these key features, one can determine whether a patient requires hemodialysis. This work uses these key features as dimensions in cluster analysis. The key features can effectively determine whether a patient requires hemodialysis. The proposed data mining scheme finds association rules of each cluster. Hidden rules for causing any kidney disease can therefore be identified. The contributions and key points of this paper are as follows. (1) This paper finds some key features that can be used to predict the patient who may has high probability to perform hemodialysis. (2) The proposed scheme applies k-means clustering algorithm with the key features to category the patients. (3) A data mining technique is used to find the association rules from each cluster. (4) The mined rules can be used to determine whether a patient requires hemodialysis. 1. Introduction When renal function is abnormal, toxins can be pro- duced, damaging organs and possibly leading to death. To The human kidney is located on the posterior abdominal extend or save the lives of patients with impaired kidney wall on both sides of the spinal column. The main functions function, kidney replacement is typically utilized, including of the kidney include metabolism control, waste and toxin kidney transplantation, hemodialysis (HD), and peritoneal excretion, regulation of blood pressure, and maintaining the dialysis (PD). Although kidney transplantation is the most body’s fluid balance. All blood in the body passes through the clinically effective method, few donor kidneys are available kidney 20 times per hour. When renal function is impaired, and transplantation can be limited by the physical conditions the body’s waste cannot be metabolized, which can result in of patients. Notably, HD can extend the lives of kidney back pain, edema, uremia, high blood pressure, inflamma- patients. tion of the urethra, lethargy, insomnia, tinnitus, hair loss, Although medical technology is mature, factors causing blurred vision, slow reaction time, depression, fear, mental diseases are changing due to changing environments. Any disorders, and other adverse consequences. Furthermore, an factor may potentially lead to disease. When the detection impaired kidney will produce and secrete erythropoietin. index of a patient exceeds the standard and kidney disease When secretion of red blood cells is insufficient, patients will has been diagnosed, patients must go the hospital for kidney have the anemia. The kidney also helps maintain the calcium replacement therapy. For instance, a doctor may recommend and phosphate balance in blood, such that a patient with that high-risk patients adjust their habits by, say, stopping renal failure may develop bone lesions. smoking, controlling blood pressure, maintaining normal 2 Advances in Artificial Neural Systems urination, controlling urinary protein levels, maintaining toxins accumulate. For chronic kidney failure, medical normal sleeping patterns, controlling blood sugar levels, treatment is first utilized and HD may be initiated after reducing the use of medications, avoiding reductions in the uremia occurs. Additionally, a doctor may assess according to body’s resistance, maintaining low body fat levels, and reduc- the causes of kidney failure, kidney size, anemic state, degra- ing the burden on the kidneys. dation of kidney function, and recovery. Moreover, each However, improving one’s physical condition and diet examination indicator will be assessed. The most commonly are insufficient. To control one’s physical condition, periodic used indicators are BUN concentration, CRE concentration, health examinations at a hospital have become a common CC, urine-specific gravity, and osmotic pressure [1, 2]. disease-prevention strategy. Doctors may offer advice to patients based on health examination results to reduce dis- 2.1.1. Blood Urea Nitrogen (BUN). Blood urea nitrogen is ease risk. the metabolite of proteins and amino acids excreted by the Many scholars have applied data mining techniques kidneys. The BUN concentration in blood can be used to for disease prediction. These techniques include clustering, determine whether kidney function is normal. The normal association rules, and time-series analysis. Different analyses BUN range is 10–20 mg/dL. If the BUN concentration may require different mining techniques. Selection of an exceeds20mg/dL,thisiscalledhighazotemia. However, appropriate mining technique is the key to obtaining valu- the BUN concentration may increase temporarily because able data. However, choosing a data mining technique is very of dehydration, eating large amounts of high-protein foods, difficult for general hospitals, especially when dealing with upper gastrointestinal bleeding, severe liver disease, infec- different forms of original data. Therefore, to help medical tion, steroid use, and impaired kidney blood flow. When professionals identify hidden factors that cause kidney the BUN concentration is high and the CRE concentration diseases, this work applies a novel hemodialysis system (HD is normal, kidney function is normal. Although the BUN system). The HD system may identify factors not previously concentration can be used as an indicator of kidney function, known. it is not as accurate as the CRE concentration and CC. General medical staff may perform routine examinations for particular factors associated with a particular disease 2.1.2. Creatinine (CRE). Creatinine is mainly a metabolite and ignore other factors that may be associated with other diseases, such as kidney diseases. For example, staff may of muscle activity and daily production is excreted through only assess blood urea nitrogen (BUN) and creatinine (CRE) the kidneys. Daily CRE production cannot be fully excreted levels and CRE clearance (CC). However, increasing amounts and the CRE concentration increases when TRY kidney of data indicate that some hidden rules and relationships function is impaired. As the CRE concentration increases, may exist. Therefore, this work uses an entropy function kidney function decreases. Because CRE is a waste generated to identify key features related to HD. By identifying these by muscle metabolism, the CRE concentration is associated key features, one can determine whether a patient requires with the total amount of muscle or weight but is not related HD. This work uses these key features as dimensions in to diet or water intake. The CRE concentration may reflect cluster analysis. When patients requiring HD are classified kidney function more accurately than the BUN concentra- into the same group, and the other patients are classified into tion. When the CRE concentration is in the normal range, the other group, the key features can effectively determine it does mean that kidney function is normal; that is, CC is whether a patient requires HD. The proposed data mining a better tool when assessing kidney function. The compen- scheme finds association rules of each cluster. Hidden rules satory capacity of the kidney is large. For example, although for causing any kidney disease can therefore be identified. the CRE concentration may increase from 1.4 mg/dL to 1.5 mg/dL, kidney function may have declined by more than 2. Literature Review 50%. 2.1. Hemodialysis. Hemodialysis is also called dialysis. An 2.1.3. Creatinine Clearance (CC). Creatinine clearance is artificial kidney discharges uremic toxins and water to widely used and is an accurate estimation of kidney function. eliminate uremic symptoms. In an HD system, a semi- permeable membrane separates the blood and dialysate. The Creatinine Clearance is the amount of CRE cleared per minute. The CC for a healthy person is 80–120 mL/min; the human blood continues passing through on one side of an artificial kidney and the dialysate carries away uremic toxins average is 100 mL/min. Kidney failure is minor when the on the other side. Finally, the cleaned blood will back into the CC is 50–70 mL/min and moderate when CC is only 30– body.Thiscontinuouscycle eventually purifiesblood. 50 mL/min. If CC is <30 mL/min, kidney failure is severe A doctor may recommend that patient undergo dialysis and uremic symptoms will develop gradually. When CC is according to the difference between acute and chronic. If <10 gradually, a patient must start dialysis. By collecting all kidney failure is acute, the doctor will recommend that the the urine produced within 24 hours, CC can be determined patient undergo dialysis before the occurrence of uremic easily. Notably, CC is derived as follows: Urine CRE mg% × 24 hours urine volume (c.c.) concentration CC = . (1) ( ) Blood CRE mg% × 1440 minutes concentration Advances in Artificial Neural Systems 3 Table 1: Kidney function test features. 2.1.4. Urine-Specific Gravity and Osmotic Pressure. Urine- specific gravity and osmotic pressure reflects the ability of Kidney function test items Reference Units the kidney to concentrate urine. If the specific gravity of Blood urea nitrogen BUN 5–25 mg/dL urine is ≤1.018 or each urine-specific gravity gap is ≤0.008, Creatinine CRE 0.3–1.4 mg/dL the ability of the kidney to concentrate urine is impaired. Uric acid UA 2.5–7.0 mg/dL Moreover, the ratio of osmolality to blood osmotic pressure Albumin-globulin in must exceed 1.0; otherwise, the ability of the kidney to A/G ratio 1.0–1.8 ratio concentrate urine is impaired. If the ratio of urine to blood M: 71–135 Creatinine osmotic pressure is ≤3 after water fasting for 12 hours, CC mL/min clearance/24 hrs urine F: 78–116 the ability of the kidney to concentrate urine is impaired. Renin Penin 0.15–3.95 pg/mL/hr Abnormal urine concentration function usually occurs in patients with analgesic nephropathy. Creatinine urine Creatinine urine 60–250 mg/dL Doctors recommend patients undergo dialysis when their Natrium Na 135–145 meq/L BUN concentration exceeds 90 mg/dL, the CRE concentra- Potassium K 3.4–4.5 meq/L tion exceeds9mg/dL, andCCis <0.17 mL/sec, or the CRE Calcium Ca 8.4–10.6 mg/dL concentration exceeds 707.2 mg/dL. However, when the BUN IP 2.1–4.7 mg/dL Phosphorus concentration begins increasing, the kidney is very fragile. ALP 27–110 U/L Alkaline phosphatase That is, the kidney that has been damaged exceeds 1/3 when HD is required [3]. Thus, indexes such as the albumin Table 2: Blood test features. globulin ratio (A/G ratio) of kidney function (Table 1), red blood cell (RBC) count in blood tests (Table 2), or white Blood test items Reference Units M: 14–18 blood cell (WBC) count by urinalysis (Table 3) are related to Hemoglobin Hb g/dL kidney function [1]. This work proposes an effective scheme F: 12–16 that identifies unknown key features to predict HD. This M: 450–600 Red blood cell RBC mil/mm work uses the entropy function to identify key features that F: 400–550 are strongly related to HD and applies the k-means clustering White blood cell WBC 5000–10000 mm algorithm to these key features to group patients. M: 40–55 Hematocrit Hct % Hung proposed an association rule mining with multiple F: 37–50 minimum supports for predicting hospitalization of HD Platelets PLT 15–40.0 10 /uL patients [4]. Hung used this association rule to analyze Mean corpuscular MCV 83–100 u factors that may lead to HD to reduce the number of patients volume hospitalized for kidney impairment. Mean corpuscular MCH 27–32.5 uug hemoglobin Hung relied on routinely examined HD indexes for patients per month, including BUN, CRE, uric acid (UA), Mean corpuscular hemoglobin MCHC 32–36 % natrium (Na), potassium (K), calcium (Ca), phosphate (IP), concentration and alkaline phosphatase levels and analyzed 667 derived Reticulocyte Reticulocyte 0.5–2.0 % variables, such as protein ratio, to determine whether mono- Malaria (−) Malaria cytes infected or a patient was undernourished. Hung Erythrocyte M: 1–15 obtained 9 rules from 5,793 records. For instance, diabetic ESR mm/hr sedimentation Rate. F: 1–20 patients with high cholesterol levels were hospitalized most. Differential count DC Inadequate dialysis was a high risk factor for hospitalization. Band Band 0–2 % If patient is female, aged 40–49, infected with monocytes, Neutrophils Neutrophils 50–70 % and had a recent hemoglobin (Hb/Ht) test value that Lymphocytes Lymphocytes 20–40 % was too low, the frequency of hospitalization was high. If Monocytes Monocytes 2–6 % hematocrit (Ht) was abnormal twice in the last three months, average platelet volume (MPV) was abnormal twice, and Eosinophils Eosinophils 1–4 % Basophils Basophils 0–1 % total protein (TP) was abnormal once, the probability of hos- pitalization was 93%. If TP, glutamic oxaloacetic transami- Bleeding times BT 0–3 Minute Coagulation times CT 2–6 Minute nase (GOT), and glutamic pyruvic transaminase (GPT) of patients were abnormal twice in the last three months and Blood type Blood type uric acid was also abnormal, hospitalization risk was 100%. Rhesus factor Rh Factor (+) Huang analyzed risk of mortality for patients on long- Blood pressure BP mm/Hg term HD in 2009 [5]. Huang used the Classification and Height Height cm Regression Tree, Mann-Whitney U Test, Chi-square Test, Weight Weight kg Pearson Correlation, and the Nomogram to analyze 992 patients on long-term HD. Albumin level and age were the factors most strongly related to mortality. Huang clustered that of nondiabetic patients. However, if a patient was and analyzed patients. If a patient had good nutrition and malnourished and older, albumin and CRE levels were the was young, mortality of diabetic patients was 5.45 times factors most strongly related to mortality. Thus, albumin 4 Advances in Artificial Neural Systems Table 3: Urine test features. We assume a classification problem that includes N data records, m feature dimensions, and k clusters. The mea- Urine test items Reference Units surement of a single feature’s information gain must be Color/appearance Color/appearance determined based on two correlated values, called entropy; Reaction pH Reaction PH 5.5–8.5 the difference between two correlated values is called infor- Protein Protein <(+) mg/mL mation entropy Sugar Sugar (−)g/dL Bilirubin BIL (−) Entropy (N) = P × log =− p × log p , (2) t t t Urobilinogen URO ≤1; 4 umol/L |D | Urine red blood cells RBC 0–3 /HPF jv (3) Entropy D = × Entropy D , j jv Urine white blood cells WBC 0–5 /HPF v=1 Pus cell Pus cell 0-1 /HPF Gain D = Entropy (N) − Entropy D . (4) j j M: 0–3 Epith cell Epith cell /HPF F: 0–15 In (2), Entropy (N) is the total information content of whole Casts Casts Not found /LPF problems, and this total information content is taken as a Ketones Ketones (−) mmol/L basis of single feature information gain, in which P is the Crystals Crystals −∼ (±) /LPF probability of occurrence of t classification in N dataset. Bacteria and other Bacteria and other − /HPF In (3), Entropy (D ) is the information content of jv the j feature dimension, the v value, and classification and information quantity, D is the j feature dimension, jv level, age, diabetes status, and CRE level can help predict risk including v kinds of values, and the j feature dimension has of mortality. |D | values. Yeh et al. used a data mining technique to predict hos- In (4), Gain (D ) is a classification problem, the informa- pitalization of HD patients in 2011 [6]. The availability of tion gain received by the j feature dimension. Through (2)– medical resources and dialysis quality may decline when too (4), the information gain of each feature for a classification many patients are admitted to a hospital. Therefore, Yeh et problem is found. This work then evaluates all threshold set- al. used analysis of the C4.5 decision tree and the multiple tings and collects the features with the greatest information minimum support (MS) association rule mining technology gain to form a feature set for classification. Entropy is used for analysis. The C4.5 decision tree was used to eliminate to identify key features and cluster HD patients to determine null values and association rule mining was used to identify the accuracy of key features. hospitalization of HD patients. According to the records of hospitalized patients, hospitalized patients seldom have a 2.3. Clustering Algorithm. Although many clustering tech- chronic disease or may not have a chronic disease, but doc- niques have been proposed, the k-means algorithm is the tors only determine whether a patient should be hospitalized most representative and widely applied [9]. The k-means during an examination. algorithm is also called the generalized Lloyd algorithm Lin used hospital records of patients combined with the (GLA) [10]. The k-means algorithm transforms each data association rule and the time-series analysis to establish a record into a data point and random numbers are utilized health-management information system for chronic diseases to generate the initial cluster center to determine which data [7]. Lin found that occluded cerebral arteries may lead to point belongs to which cluster point. The divided data points cerebral thrombosis and a cerebral embolism. After exami- are used to calculate the distance between a data point and nation by a doctor, the rule is effective in avoiding a second the cluster center, such that a data point will belong to one stroke. Additionally, ill-defined heart diseases still require cluster center when the data point is closer to one cluster improvement. Lin used data mining to provide the chronic center than another cluster center. The newly recomputed disease patients’ family members and medical staffsfor con- cluster center is the average among all data points in a cluster, trolling their disease. and the new cluster center is taken as a basis for the next These scholars usually used well-known blood tests as iteration. This process is repeated until no change occurs. mining rules. This work uses an effective and novel scheme The steps of the k-means algorithm are as follows. to identify some previously unknown features to predict HD. (1) Use random numbers to generate the initial cluster The entropy function is applied to identify features that are centers C ={1, 2,... , k}. strongly related to HD, and the k-means clustering algorithm (2) Calculate the Euclidean distance d(X, C )for each is applied with these key features to group patients. i data point X ={x , x ,... , x } and each cluster 1 2 m center C . The point with the shortest distance is 2.2. Entropy Function. Information gain, proposed by Quin- classifiedinto C , and the distance formula is as fol- lan in 1979 [8], is a basis of the decision tree constructed by lows: Interactive Dichotomiser 3 (ID3). Information gain can also be utilized to determine differences in feature attributes and 2 d(X, C ) = x − c . (5) i j ij other classification attributes. Further, it is usually used to j=1 select the split point of ID3. Advances in Artificial Neural Systems 5 (3) Recompute the new cluster center C . If the move- to find the key features that are strongly related to ment of all data points in a cluster stop moving, all diseases. clustering work stops; otherwise, steps (1) and (2) are (3) The mining procedure is also divided in two subpro- repeated for clustering. cedures. For clustering analysis, one subprocedure, the clustering algorithm is applied to these key fea- 2.4. Association Rule. An associationruleisawidely used tures to group patients. For the association rule, the technique. It progressively scans a database to identify other subprocedure, the Apriori algorithm is applied rules for the relationships between items. For instance, the to find the association rule in each cluster. probability that people will buy bread after buying milk is (4) The output procedure may express the entire mining milk → bread (support = 50% and confidence = 100%); result, and a medical professional will explain the support means that the probability of a consumer buying mining result, and find any factor that may cause a both milk and bread is 50%, and confidence means that the disease. probability of a consumer buying bread after buying milk is 100%. 3.1. Input Procedure. Examination information is from many Agrawal et al. developed the Apriori algorithm in 1994 sources, such as a hospital information system (HIS), lab- [11]. The Apriori algorithm is one of the most popular data oratory information system (LIS), or Excel report. These mining methods, where I is all itemsets, each data record is different systems may have different data storage formats. X ={x , x ,... , x },and X ⊆ I. The expression of the asso- 1 2 m For example, in the A database, gender is 1 for male and 2 ciation rule is x → x (support, confidence), where x ⊆ 1 2 1 for female, but in the B database, M is for male and F is I, x ⊆ I,and x ∩ x = ϕ.Support andconfidence affect 2 1 2 for female. Thus, an error may occur while collecting data. mining results most. Support is the occupied percentage for Therefore, one should apply the preprocess process to ensure N data records and the probability of occurrence of both x that information is correct, complete, and sufficient. The and x is (x ∪ x )/N. Confidence is the probability of x and 2 1 2 1 preprocess process is divided into five steps. x and is called a strong association rule. First, set the threshold of minimum support and min- (1) Unified data storage format: to simplify mining, all imum confidence to generate frequently occurring items, information must be in the same format. where L represents frequently occurring b-itemsets, and all (2) Irrelevant data: if one does not specify the mining generated L frequent itemsets are combined to generate topic, mining efficiency and even accuracy will be candidate itemsets. Only the support and confidence values adversely affected. that are greater than the minimum support and minimum (3) Incorrect data: incorrect data may be caused by a confidence thresholds are retained. This process is repeated source error or login error; thus, one should modify until all L frequent itemsets are identified. or remove. (4) Formats do not match: to smooth information min- 3. Proposed Algorithms ing, information must be converted into an appropri- This work applies a novel and effective scheme to find key ate format when necessary. features that predict HD. This work uses the entropy function (5) Incomplete data: incomplete data is a common prob- to find the key features that are strongly related to HD and lem; for example, some information may be lost, applies the k-means clustering algorithm with these key fea- lacking for a certain period. tures to group patients. Furthermore, the proposed scheme applies the data mining technique to identify association 3.2. Preprocess Procedure. Data are standardized to improve rules from each cluster. These rules can be used to warn analytical accuracy. A standard value may be applied to patients who may require HD. Figure 1 shows the system an item such as triglycerides (TG). If the TG level is architecture, which is divided into four procedures. ≥201 mg/dL, it exceeds and the standard is 100; if TG is These procedures are as follows. normal it is in the range of 20–200 and the standard is 50; if TG is smaller than <19 mg/dL, it is lower than the standard (1) The input procedure, which should be handled very and the standard is 0. If data are consecutive, a packing carefully, can determine the disease target and input normalization method is used; its formula is as follows: various sources and formats into a database. This procedure has a marked impact on the subsequent v − min j j procedure. v = × Q , (6) max − min j j (2) The preprocess procedure is divided into two sub- procedures. For quantitative processing, one subpro- where v represents raw data, min is the minimum value of j j cedure, data are converted into an appropriate ana- j,max is the maximum value of j, v is the packing nor- lytical form; for example, a string form is converted malized value, and Q is quantified distance. Table 4 shows into a numeric form, or a numeric form is converted example data after quantization. into a similar spacing. For selecting features, the other Table 4 is a normalized form used to derive information subprocedure, this work uses the entropy function gain and in association rule analysis, and it can effectively 6 Advances in Artificial Neural Systems Input procedure Input data HIS systems data Data warehouse LIS systems data Streamline Sampling Excel report Quantitative Preprocess procedure Extreme value Select features Picking Entropy function No Yes Information sort Excluded Outliers Mining procedure Data mining Filtering (missingNum > minMissing) Clustering analysis Association rule analysis Output procedure Result Figure 1: The system architecture. differentiate between patients. This work simultaneously the influence of outlier values, this work sets a minNum uses extreme value normalization; its formula is threshold for each record. For example, assume minNum = 3 is the threshold. The total number of hemoglobin (HB), v − min j j v = × 100, which is quantified as 2 (HB = 2),is9;however,thatofHB, (7) max − min j j which is quantified as 0 (HB = 0), is 1. This means that most data are assigned to HB = 2, and only 1 datum is assigned where v represents raw data, min is the minimum value j j to HB = 0. The total number of quantified values that are of j,max is the maximum value of j,and v is the packing smaller than minNum is the extreme value. This scheme normalized value. For instance, if the WBC value is 1, max = replaces the extreme value with the average value. 10.7, and min = 3.5, then v = [(1 − 3.5)/(10.7 − 3.5)] × 100 = 38.89% can be derived by applying (7). In the entire database, the maximum and minimum 3.3. Information Gain Analysis. This work uses dialysis item values of each item markedly affect the quantification result, to identify information gain. For example, 6 patients are on and the values are called outliers. If outliers exist, anomalies dialysis (Dialysis = 1) (Table 4), the occurrence probability will also exist; for example, suppose that Q of CRE is 80, and is P = 6/15, and information gain is P × log(1/P ) = 1 1 1 CRE values are generally 0.37–2.99; however, a polarization (6/15) × log(6/15) = 0.528771. When 9 patients are nondi- datum may occur when a record is 6990. After quantization, alysis (Dialysis = 0), occurrence probability is P = 9/15, values in the range of 0.37–2.99 will be quantified as 1, and information gain is P × log(1/P ) = (9/15) × log(9/15) = 0 0 the value recorded as 6990 will be assigned 80. Therefore, 0.442179, and total information gain of P and P is 0 1 this work creates a mechanism to remove outliers. To avoid 0.970951. Advances in Artificial Neural Systems 7 Table 4: Packing method normalized data. Sex Age WBC RBC HB BUN CRE UA GOT GPT TP ALB GLO A/G TG Dialysis 25 4 3 3 4 2 2 4 5 2 2 2 2 3 2 11 3 1 1 1 3 1 0 0 0 0 0 0 0 1 1 20 1 3 1 0 0 0 0 0 1 1 0 1 0 1 1 30 1 1 0 0 1 0 0 1 1 0 0 0 0 1 1 40 2 0 1 0 0 0 0 2 0 0 0 0 0 2 0 51 3 1 1 1 2 1 0 3 4 1 1 1 0 1 0 61 1 1 1 2 1 1 1 0 0 0 0 0 0 1 1 70 4 2 0 0 1 1 0 0 0 1 0 1 0 1 0 81 1 1 1 2 3 1 0 2 4 0 0 0 1 2 1 91 2 0 1 2 2 1 1 1 2 0 0 0 1 1 0 10 0 2 3 1 0 2 0 1 2 1 0 0 1 0 2 0 11 1 0 1 2 2 1 0 0 1 1 1 1 1 0 1 0 12 0 0 1 1 0 3 0 0 0 0 0 0 0 0 1 0 13 0 2 1 2 0 0 0 0 0 1 1 0 1 0 1 1 14 1 2 2 0 1 1 1 1 1 0 0 1 0 1 1 0 15 1 2 3 1 2 3 1 1 1 1 0 1 0 1 1 0 Table 5: Calculation information gain of sex relative to dialysis. Sex j Dialysis Count (D ) P P × log(1/P ) Entropy (D ) Entropy (D ) jν D D D jν j jν jν jν 04 4/7 0.46 0 0.99 0.459773 13 3/7 0.52 05 5/8 0.42 1 0.95 0.509031 13 3/8 0.53 Sum 0.968804 Next, this work calculates the information gain of each each record. If missingNum > minMissing, then the record item relative to dialysis item. Take Sex (Table 5) as an exam- is removed. Otherwise, missingNum  minMissing, the ple. The Sex of 7 women is 0 (Sex = 0) and only 4 records with record will be retained and the missing null values will = 0), the probability is P = 4/7of be replaced by the mean value. For instance, Age, WBC, non-dialysis (Dialysis jv and BUN are the top three key features when records are Sex = 0 and Dialysis = 0, and information gain is 0.46. Three missing records. Assume minMissing is 1. When a record for records have Sex = 0 and Dialysis = 1; thus, the probability which missingNum > 1, the record is removed; otherwise, the P = 3/7, and information gain is 0.52. Total information jv record is retained and the missing null values are replaced by gain of 0.46 and 0.52 is 0.99. Information gain of the women the mean value. is 0.99 × (7/16) = 0.459773 because the probability of Sex = 0 is 7/16. After summing the information gain of the women (Sex = 0) and men (Sex = 1), total information gain 3.4.2. Clustering. This work uses key features for clustering, is 0.968804, where 0.968804 = 0.459773 + 0.509031. Next, where x , x ,... , x as m key features, X ={x , x ,... , x } 1 2 m 1 2 m are patient records, x is a key feature in X,1 ≤ j ≤ m,and k via (3), whichisEntropy(N) − Entropy (D ), Gain (D ) = j j j 0.970951 − 0.968804 = 0.002147. is the cluster number. The k-means process is as follows. The information gain of each item related to dialysis can (1) First, randomly generate k initial cluster centers C = be obtained and ranked, and the association rule can be {c , c ,... , c }. Figure 2(a) has ten solid circles, N = 1 2 m mined using the top few items as key features. Take Table 6 10, which are the locations of each record, and three as an example. Assume that the top three items are chosen. triangles, k = 3, which are the locations of cluster Thus,Age,WBC,and BUNare takenaskey features. centers C . m 2 (2) Apply (5), d(X, C ) = (x − c ) ,tocalculate i j ij j=1 3.4. Data Mining Procedure the distance between each patient’s data point X and the cluster center C . When some X distance d is less 3.4.1. Missing Values. Some patients may have missing val- i 1 than d , X will be classified to C . i 1 ues. If their records are removed directly, some import infor- mation may be lost. Thus, this work applies a second filter (3) Let C ={X , X ,... , X } be a cluster center i c 1 c 2 c S i i i before data mining analysis. This research sets minMissing membership, where S is the total number of members as the threshold and takes missingNum as a null value of in C ,and X is u patient’s data point in C .Thus, i c u i i 8 Advances in Artificial Neural Systems Table 6: Information gain of each item. Items Sex Age WBC RBC HB BUN CRE UA GOT GPT TP ALB GLO A/G TG Gain 0.002 0.577 0.329 0.14 0.06 0.28 0.05 0.09 0.18 0.2 0.05 0.24 0.02 0.06 0.03 in cluster C is S, and each cluster membership is C = i i {X , X ,... , X }; thus, the u patient’s data point X is in c 1 c 2 c S c, u i i i the C , and the j key features x are in X ={x , x ,... , i uj c u u1 u2 x }. Next, the association rule is used to analyze each cluster um C . (1) First, set the values of minimum support minSup and minimum confidence minConf. (2) Convert the normalization table into an extreme C values table. 1 C (3) Find the candidate set. We assume α = item ,where jp item is the p quantified value of the j key feature in jp C ,1  p  Q, and Sup(α) denotes the occurrence probability of item in C .IfSup(α)  minSup, then jp item becomes a candidate itemset L and proceed jp to the next step. (4) Through candidate set L ={α , α ,... , α }, generate z 1 2 y (a) Initial dataset and cluster center (Before) a set of two items, L ={α ∪ α , α ∪ α ,... , α ∪ y 1 2 1 3 1 α , α ∪ α ,... , α ∪ α };however, α and α y 2 3 y−1 y A B cannot be the same item. Calculate the occurrence newC probability of each group, Sup(α ∪ α ). If Sup(α ∪ A B A α ) > minSup, it becomes a member of frequent itemset L . Z+1 newC (5) Take L as a candidate set and repeat step (4) until the candidate set is null. (6) Generate the association rule of the frequent itemset. If the confidence of the rule exceeds minConf, the rule is set up and the process is as follows. ∗ ∗ (i) Let α be one of the frequent itemsets L , α = (R ∪ R ). A B newC (ii) Generate rules R → R and R → R . A B B A In the case of A clustering C , where minSup = 2and minConf = 0.5, the key features are item = Age, item = 1 2 (b) Center displacement (After) WBC, and item = BUN, and S = 7 is the total number of records in C . Thus, this work finds the frequent itemsets L A 1 Figure 2: The diagram of clustering algorithm. using the minSup and minConf thresholds. The proposed scheme merges two items by L as a candidate set, where j = Age, p = 3in α ,and j = WBC and p = 1in α , and then 1 3 newC will be added to the sum of X in each i c S calculates Sup(α ∪α ). If Sup(α ∪α )  minSup, then let α i 1 3 1 3 1 s s and α be the two frequent itemsets until no more frequent C , and newC ={( x /S), ( x /S),... , i i u1 u2 3 u=1 u=2 ( x /S)} can then be obtained. This function itemsests are found. uj u=1 can also be taken as a new cluster center. Next, the quantified values are converted back into their original values if all rules are found; the formula is (4) Repeat steps (2) and (3) until each C remains the same. v = × max − min +min, (8) Q j j j 3.4.3. Association Rule. Next, the proposed scheme finds each clustering characteristic rule using Apriori association where v is a quantified value, min is the minimum value of rule analysis. We assume that the total number of records j,max is the maximum value of j, v is the original value, j j Advances in Artificial Neural Systems 9 and Q is a quantified interval. Take WBC = 1 → Age = 3 Table 7: Each interval of item. as an example rule. If the max of WBC is 10.7, the min j j ID Item Interval value is 3.5, and Q is 4; then the original value of WBC = 1is 1TG 50 WBC = 1/4×(10.7−3.5)+3.5 = 5.3. If the max value of Age 2AST(GOT) 20 is 68, the min value is 30, and Q is 5; then the original value j j 3Ch 50 of Age = 3isAge = 3/5× (68− 30) + 30 = 52.8. Through (8), 4ALT(GPT) 20 the association rule WBC = 1 → Age = 3 can be transformed 5UA 2 into WBC = 5.3 → Age = 52.8. 6 K (Boold) 2 4. Experimental Results 7BUN 5 8Amylase(B) 50 This experiment uses health examination records provided ... ... ... by hospitals. The data are mainly for outpatient dialysis and general outpatients. The hospital has 105 records with many values missing. This is because each patient does not undergo all examinations. Therefore, data must first be filtered to Table 8: Clustering results. eliminate records with missing values. This work adopts Cluster UA AST (GOT) TG K (Blood) Density BUN and CRE, which are related to kidney function, as the Cluster-1 6.54 24.48 119.72 5.10 14.14 first filter. If any null value occurs in BUN or CRE, the record Cluster-2 6.16 30.12 138.92 3.92 11.59 is removed. In total, 18,166 records are retained after the first filtering. Cluster-3 4.47 24.72 112.33 4.07 11.22 The purpose of quantification in the preprocess proce- Cluster-4 8.40 28.03 228.72 4.20 20.91 dure is to convert values into a continuity value or significant difference value from a finite interval. This work sets interval Q for each item based on recommendations by medical staff. Table 7 shows the intervals. collects the features with the greatest information gained to form a feature set for classification. Entropy is used to 4.1. Choose Key Features. The mining result does not make identify key features and cluster HD patients to determine the sense when too many items are used. The proposed scheme accuracy of key features. During the clustering process, the uses the Entropy function to identify the top 4 key features clustering algorithm is applied on these key features to group between each item and dialysis; these features are are UA, patients, and the entropy function can effectively determine AST (GOT), TG, and K (Blood). clustering analysis with the key features. Furthermore, this work applies the apriori algorithm to find the association 4.2. Mining Procedure rules of each cluster. Hidden rules for causing any kidney disease can therefore be identified. 4.2.1. Clustering Analysis. Based upon the above clustering This experiment adopts the health examination records algorithm, this work applies the k-means clustering algo- provided by one general hospital of Taiwan. During the rithm with these key features to group patients. Before the experiment process, the experimental results will be dis- experiment, records with many missing values were filtered cussed with medical staffs. From the experimental results, we out, leaving 7118 records. Table 8 shows the cluster grouping can find that if BUN is in the range of 58.5–61.5 (60± 1.5) result. For example, 1169 patients are classified into the first and Na (Blood) is in the range of 137.5–140.25 (140± 2.5), group. The average indicator values are UA = 6.54, AST patients have a high risk of receiving a dialysis. The BUN (GOT) = 24.48, TG = 119.79, and K (Blood) = 5.10, and is reported to be a reliable indicator of high risk, but the the average density of the first group is 13.26. The average Na (Blood) is not clearly defined. Therefore, the Na (Blood) difference among all groups is 27.02, which is the best result needs for further analysis and clarification. Conversely, if UA of 100 random trial runs. is in the range of 6.25–6.75 (6.5± 0.25), TG is in the range of 134.75–184.75 (159.75±25),and K(Blood)isinthe range 4.2.2. Association Rule Analysis. This work identifies the top of 3.89–4.39 (4.14± 0.25), or AC-GLU is in the range of four itemsrelated to dialysisasTG, AST(GOT),UA, andK 111–161 (136± 25), patients have a low risk of receiving a (Blood); AST (GOT) is the main indicator of liver function. dialysis. These four items are adopted as key features and the The medical staffs express that the UA, TG, and AC-GLU association rule technique is applied to analyze each group will definitely affect the possibility of patients to receive a rule after clustering, where minSup = 35% and minConf = dialysis, but K (Blood) is not clearly defined to create an 65%. The association rules of the four clusters are shown in influence on patients. The factor should be further analysis. Table 9. At last, there is one more special feature, AST (GOT) because 4.3. Summary. This work uses the clustering algorithm and it appears both in the groups of high risk and low risk. The the association rule algorithm to identify some previously medical staffs express, actually AST (GOT) is not directly unknown features of HD patients and possible association relatedtoHD. Thus,AST (GOT)isnot akey factor to rules. This work then evaluates all threshold settings and determine whether a patient requires HD. 10 Advances in Artificial Neural Systems Table 9: Association rule of each cluster-k. ∗ ∗ α Sup (α)Conf. Cluster-1 (k = 1) BUN = 60 ± 1.5 → Dialysis = Yes 487 91% Dialysis = Yes → AST (GOT) = 24.5 ± 10 708 74% AST (GOT) = 24.5 ± 10 → Dialysis = Yes 523 73% Na (Blood) =140 ± 2.5 → Dialysis = Yes 455 70% Dialysis = Yes→ BUN = 60 ± 1.5 487 69% Na (Blood) = 140 ± 2.5 → AST (GOT) = 24.5 ± 10 434 66% Cluster-2 (k = 2) CRE = 0.85 ± 0.15 → Dialysis = No 487 91% UA = 6.5 ± 0.25 TG = 159.75 ± 25 → Dialysis = No 1341 97% AC-GLU = 136 ± 25 → Dialysis = No 1265 94% TG = 159.75 ± 25 → Dialysis = No 1696 93% UA = 6.5 ± 0.25 → Dialysis = No 1920 93% AST (GOT) = 45 ±10 → Dialysis = No 1479 92% K (Boold) = 4.14 ± 0.25 → Dialysis = No 1938 91% TG = 159.75 ± 25 Dialysis = No→ UA = 6.5 ± 0.25 1341 79% TG = 159.75 ± 25 → UA = 6.5 ± 0.25 1378 76% TG =159.75 ± 25 → UA = 6.5 ± 0.25 Dialysis = No 1341 74% UA = 6.5 ± 0.25 Dialysis = No→ TG = 159.75 ± 25 1341 70% UA = 6.5 ± 0.25 → TG = 159.75 ± 25 1378 67% UA = 6.5 ± 0.25 → TG = 159.75 ± 25 Dialysis = No 1341 65% CRE = 0.85 ± 0.15 → Dialysis = No 487 91% UA = 6.5 ± 0.25 TG = 159.75 ± 25 → Dialysis = No 1341 97% Cluster-3 (k = 3) CRE = 0.85 ± 0.15 → Dialysis = No 732 100% CRE = 0.85 ± 0.15 K (Boold) = 5 ± 0.25 → Dialysis = No 560 100% K (Boold) = 4.14 ± 0.25 → Dialysis = No 910 95% AST (GOT) = 24.5 ± 10 K (Boold) = 4.14 ± 0.25 → Dialysis = No 507 94% AST (GOT) = 24.5 ± 10 → Dialysis = No 505 92% AST (GOT) = 24.5 ± 10 → Dialysis = No 679 86% CRE = 0.85 ± 0.15 → K (Boold) = 4.14 ± 0.25 560 77% CRE = 0.85 ± 0.15 Dialysis = No→ K (Boold) = 4.14 ± 0.25 560 77% CRE = 0.85 ± 0.15 → K (Boold) = 4.14 ± 0.25 Dialysis = No 560 77% AST (GOT) = 24.5 ± 10 Dialysis = No→ K (Boold) = 4.14 ± 0.25 507 75% Dialysis = No → K (Boold) = 4.14 ± 0.25 910 74% AST (GOT) = 24.5 ± 10 → K (Boold) = 4.14 ± 0.25 539 68% Cluster-4 (k = 4) AST (GOT) = 45 ± 10 K (Boold) = 4.14 ± 0.25 → Dialysis = No 364 98% K (Boold) = 4.14 ± 0.25 → Dialysis = No 503 91% AST (GOT) = 45 ± 10 → Dialysis = No 537 90% K (Boold) = 4.14 ± 0.25 Dialysis = No → AST (GOT) = 45 ± 10 364 72% Dialysis = No → AST (GOT) = 45 ± 10 537 71% AST (GOT) = 45 ± 10 Dialysis = No→ K (Boold) = 4.14 ± 0.25 364 68% K (Boold) = 4.14 ± 0.25 → AST (GOT) = 45 ± 10 372 68% Dialysis = No→ K (Boold) = 4.14 ± 0.25 503 67% K (Boold) = 4.14 ± 0.25 → AST (GOT) = 45 ± 10 Dialysis = No 364 66% Advances in Artificial Neural Systems 11 5. Conclusion Medical staffs try to find some information from patient’s health examination records to reduce the occurrence of disease. However, some hidden information may be ignored because of the human observation or the restriction of book. Although there are many data mining techniques that have been proposed, most of them are focused on some known items. Seldom techniques in regard with searching for hidden key features are proposed. The reason is because the examination items are too many but incomplete. It is hard to find out the association rule by using system. This research will help medical staffstofind some unknown key features to predict the hemodialysis. We apply k-means clustering algorithm with these key features to group the patients. Furthermore, the proposed scheme applies data mining technique to find the association rule from each cluster. The rules can help the patients to detect any occurrence possibility of disease. Acknowledgment The authors would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this paper under Contract no. NSC 99-2622-E-324-006- CC3. References [1] DrKao, “Normal Test Values,” 2010, http://www.drkao.com/ 1st site/health wap/normal main.htm. [2] Green Cross, “How to Detect Renal Function,” 2010, http:// www.greencross.org.tw/kidney/symptom sign/kid func.html. [3] Shin Kong Wu Ho-Su Memorial Hospital, 2010, http://www .skh.org.tw/mnews/178/4-2.htm. [4] K. C. Hung, Multiple minimum support association rule mining for hospitalization prediction of hemodialysis patients [M.S. thesis], Computer Science and Information Engineering, 2004. [5] S. Y. Huang, The evaluation & analysis of the risk of mortality for patients receiving long-term hemodialysis proposal [M.S. thesis], Graduate Institute of Biomedical Informatics, 2009. [6] J. Y. Yeh, T. H. Wu, and C. W. Tsao, “Using data mining tech- niques to predict hospitalization of hemodialysis patients,” Decision Support Systems, vol. 50, no. 2, pp. 439–448, 2011. [7] Y. J. Lin, Applying data mining in health management infor- mation system for chronic desease [M.S. thesis],Departmentof Computer Science and Information Management, 2008. [8] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. [9] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, andA.Y.Wu, “Anefficient k-means clustering algorithms: analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, 2002. [10] J. Z. C. Lai, T. J. Huang, and Y. C. Liaw, “A fast k-means clus- tering algorithm using cluster center displacement,” Pattern Recognition, vol. 42, no. 11, pp. 2551–2556, 2009. [11] R. Agrawal, R. Srikant, H. Mannila et al., “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, pp. 307–328, 1996. Journal of Advances in Industrial Engineering Multimedia Applied Computational Intelligence and Soft Computing International Journal of The Scientific Distributed World Journal Sensor Networks Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Advances in Fuzzy Systems Modelling & Simulation in Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Submit your manuscripts at http://www.hindawi.com Journal of Computer Networks and Communications  Advances in  Artic fi ial Intelligence Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Biomedical Imaging Artificial Neural Systems International Journal of Computer Games Advances in Advances in Computer Engineering Technology Software Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Reconfigurable Computing Computational Advances in Journal of Journal of Intelligence and Human-Computer Electrical and Computer Robotics Interaction Neuroscience Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal

Advances in Artificial Neural SystemsHindawi Publishing Corporation

Published: Aug 9, 2012

References