Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Medical data preprocessing for increased selectivity of diagnosis

Medical data preprocessing for increased selectivity of diagnosis In this review, we present a framework that will enable us to obtain increased accuracy of computer diag nosis in medical patient checkups. To some extent, a new proposition for medical data analysis has been built based on medical data preprocessing. The result of such prepro cessing is transformation of medical data from descriptive, semantic form into parameterized math form. A proper model for digging of hidden medical data properties is presented as well. Exploration of hidden data properties achieved by means of preprocessing creates new possibili ties for medical data interpretation. Diagnosis selectivity has been increased by means of parameterized illnesses patterns in medical databases. Keywords: data classification; hidden data properties; medical diagnostics. new medical investigation. It is caused by the pro gress of illness, and also by patient's natural defence against the disease. So, we cannot create statistics of diagnosis results as an assumed repeatable medical measurement. The aboveenumerated features of diagnosis make the medical diagnosis simultaneously always incomplete and uncertain. As a result, each diagnosis usually suggests a number of illnesses, and only the physician's experience decides the correctness of the diagnosis. What can we do to create computer tools for medical diagnosis? Such efforts started around 1972 with Bayes' net application in de Dombal math model of diagnosis [1]. Today computer diagnosis is not standard in medical treatment against dynamic growth of the medical data processing by means of computer methods. Among the computer methods for medical diagnosis, the basic method is the exploration of the classifiers model [2]. We also propose some classifier applications but only after medical data preprocessing from the symptoms' view of the illness to the parameterized view of the illness. Such a transformation allows us to include symptoms weighting in the objective, expertindependent manner. Symptoms understood as equivalent in medical data bases sometimes makes the similarity of illnesses very high. Using the parametric model of illness, one can obtain higher diversity of medical database contents. We prove that computer diagnosis is of higher sensitivity as well. The literature in the field of medical data process ing is huge, so the references could be more extensive of course. But the number of publications seems to be too high to analyze fully in this short review. Therefore, only basic citations which match with our idea and model are included here. Introduction and motivation What are the basic features of data from medical diagno sis? Enumeration could be like: a. The time of diagnosis creation is much shorter than that of illness evolution, and duration. b. The number of registered symptoms during patient checkup is usually far lower than the number of all of the symptoms characteristic for the illness. c. The presence, duration, and exposition of the symp toms depend on individual patient reaction, and symptoms intensity may be too weak to be observed during a short checkup. d. Repetition of diagnoses in the exactly same condi tions as the examination of previous symptoms is not possible. Each next diagnosis is always different from the previous one, and must be treated as a completely *Corresponding author: Andrzej Walczak, Department of Cybernetics (WCY), Military University of Technology (WAT), Warsaw, Poland, E-mail: awalczak@wat.edu.pl Michal Paczkowski: Department of Cybernetics (WCY), Military University of Technology (WAT), Warsaw, Poland Background The question appears when and whether symptom indicates illness without any doubt? The only case is when a single symptom is characteristic for one and only one illness. Let 40Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis us call such a symptom a "diamond of the first kind". It can also called "pathognomonic" in medicine. A different situation arises when a symptom is unique for one illness but is present also in number, say k, of dif ferent illnesses. Let us call such a symptom a "diamond of the second kind". The third situation is when a group of symptoms are always present together within the symptoms of illness. This is especially true when risk factors coincide with observed symptoms. We will call such a group of symp toms, taken together with risk factors, as a "diamond of third kind". We created a medical database for asthma, skin allergy, and also for skin illnesses. It was part of a medical project within the Innovative Industry Operating Program [3]. This database consists of a descriptive, semantic form of data related as: symptom valuemedical proce dure for this value examinationillness. All symptom values are equivalent in such a form of data. Such data are in fact only the descriptive illness pattern, and are rather common in medicine. All data in our database are created in accordance with the IDC10 standard of illness coding. Let us calculate the level of similarity among such common patterns by means of logic operation such as the Jaccard distance: J dAB = 1 - skin illness and skin allergy. Some results are shown in Figure 1 to illustrate the presence of very high intrinsic similarity of illnesses. One can see in the figure that most of the illness patterns (codes IDC10 are applied) are very similar to each other. So, if patient symptoms belong to the illnesses subset with intrinsic similarity, then diagno sis quality must rather be poor. Obvious questions arise: Is it possible to obtain an illness pattern in the form that allows avoiding such big intrinsic illness similarities? How can we get more diver sity in our database and also, as a result, more sensitive diagnosis? Can we determine illnesses using symptoms described above as "diamonds"? Methods and models Let us assume that each single symptom can be modeled as a shape, say a square, and is characterized by means of the following numbers: the area of this square, assumed weight, and circumference. So, each symptom can be described by those three numbers. Say all equivalent symptoms have those numbers equal to 1. If the symptom is a "diamond of the first kind", its weight, area, and also circumference have those values correct, the same as all rest of the symptoms of the consid ered illness because of meaning such "diamond" during the diagnosis creation. In the circumstances when illness, possesses the number of symptoms equal to l symptom in the database, then such a single "diamond" has (l­1) times bigger weight, area, and circumference than each other symptom of the considered illness. When the measured symptom is a "second kind of diamond", its weight, field, and circumference are (l­1)/k times bigger than the other symptoms for the illness, and k is the number of illnesses in which we can observe the examined symptom. When patient illness is manifested as a "third kind of diamond", then all of the group of symptoms expand its weight, area, and circumference adequately over the rest of the symptoms in examined illness appropriately. Finally, one can express symptoms of the illness just using numbers that can be assumed as a vector component in threedimensional spaces. Such space is composed of weight, area value, and circumference of symptoms. The presented "geometrical" description of our reasoning allows us to understand the core model. We also obtain a method for symptom weighting in a way that is objective and inde pendent of an expert point of view. Our symptom weighting arises strictly from knowledge about diseases. All medical databases have been transformed from the semantic shape into vector space as described above. A B A B A B - A B A B (1) In Jaccard distance expressed in formula (1), A and B are sets of symptoms for illness A and illness B, respectively. For illustration purposes, the Jaccard distances for illness similarity were calculated for the database for Illness code B36 B85 C84.0 H26.8 L13.0 L20.0 B36 1.00 0.91 0.78 0.72 0.12 0.13 B85 0.91 1.00 0.85 0.77 0.14 0.14 C84.0 0.78 0.85 1.00 0.90 0.12 0.13 H26.8 0.72 0.77 0.90 1.00 0.12 0.12 L13.0 0.12 0.14 0.12 0.12 1.00 0.61 L20.0 0.13 0.14 0.13 0.12 0.61 1.00 Figure 1:Example of Jaccard coefficient between illnesses chosen from part of the created database. Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis41 Now, we have two equivalent forms of medical database. One only describes illnesses by means of symptoms, and the other by means of numbers. In the second database, illnesses intrinsic similarity is much lower than that in the first one. Moreover, illnesses with "diamonds" are distin guished from the content of database. k in accordance with the concept of "diamonds". When a "diamond of the first kind" is recognized, the symptom in that case k is of course equal to 1. Now, illness L is a vector of longitude: L= ( w ( m, k ) + f i 2 i ( m, k ) + ci2 ( m, k )) . (5) Medical diagnosis in the parameterized illness model Let X be the space of data x from the patient examination. Let us also describe illness L as a set of symptoms li: L = l1 , ..., lm A proper vector constructed as a result of current diagnose is of the form: x= ( w ( m( L ), k( L )) + f i 2 p x x 2 p 2 ( m( Lx ), k( Lx )) + c p ( m( Lx ), k( Lx ))) . (2) (5a) And, the final diagnosis is created as an outcome of the shortest Euclidean distance taken over all the trans formed database for each patient instance: d( Lx , x ) = Lx - x , (6) In the above formula, m denotes the number of all recognized and described symptoms of illness L. In each instance, xX is related to all possible L2L. Each medical diagnosis creates pairs (x,Lx)X×2L. The Lx item means in each illness which symptoms match the symptoms present in patient checkup. During the presented transfor mation T from semantic representation into vector space, each symptom transforms into T : li wi ( m, k ), fi ( m, k ), ci ( m, k ) , (3) where wi, fi, and ci are the symptoms weight, field, and perimeter calculated in accordance with the idea of "dia monds", respectively. Values of weight, field, and perim eter for each single symptom i always depend on the values of m and k in accordance with the concept of "dia monds", and the resulting rules for calculation of w, f, and c. So, those three numbers are the math model for each symptom. Adequately for each diagnosis, T : xi { w p ( m( L ), k( L )), fp ( m( L ), k( L )), c p ( m( L ), k( L ))}. (4) Components wp, fp, and cp are calculated appropri ately in accordance with the core idea of "diamonds" for patient checkup. appropriately. Here, the most charac teristic feature of the proposed model is seen. The vector component of patient checkup differs from one illness to another even if describes the same symptom. It is because wp, fp, and cp depend on illness L data structure directly. We can say that the patient's data measured during checkup creates a cloud of points not a single point in the created vector space (see formula 4). It is because the vector com ponents wp, fp, and cp depend on the number of symp toms m in each possible illness Lx, which suits symptoms obtained during patient checkup x, and also depends on where Lx means that the illnesses subset of L is the sense of the set product LX0, and the instance x and subset of illness Lx are transformed into vector space of parameter ized illnesses. Simultaneously, Lx drops part of "clouds of points" generated in formula (5a) in a manner that fits symptoms inside L and x appropriately. Results The example of illness space is shown in Figure 2 for skin allergies. One can see that the diversification of illnesses is higher in the common database. And, examples of diag noses are shown in Figures 3 and 4. Two very similar illnesses have been examined as an example of diagnosis after patient checkup. The param eterized vector of illness is equal to [wi, fi, ci]=[894, 894, 228], and patient checkup produces vectors equal to [wp, fp, cp]=[641, 641, 184], while fish skin is compared, and [wp, fp, cp]=[252, 252, 140] while pityriasis rosea is com pared. Euclidean distance is equal to d(Lx,x)=359,7 and d(Lx,x)=901,8 appropriately. The final diagnosis is pointed out for ichthyosis vulgaris. It remains in accordance with physicians' diagnosis. We must underline that "first kind of diamonds" are absent for those two illnesses. One can see that concept of "diamonds" causes that symptoms reg istered during patient checkup are differently measured in vector space for each considered illness. Values of [wp, fp, cp] are always measured with relevance to vectors of each 42Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis 200 150 f 100 50 600 0 0 400 c 200 400 w 600 200 800 0 800 300 200 f 100 0 0 400 c 600 200 400 w 600 800 0 Figure 4:Ichthyosis vulgaris (IDC-10, Q80.0) diagnosis in parameterized illness space. Figure 2:The example of vector space for illnesses from the considered database. 200 150 f 100 50 0 0 200 400 w 600 800 0 200 800 600 400 c Figure 3:Pityriasis rosea (IDC-10 L.42) diagnosed in parameterized illness space. illness in the database. The smallest distance defined for patient's and illness' vectors indicates the nearest illness with relevance to symptoms measured during patients checkup. So, the same checkups produces a cloud of vectors for patient diagnosis. Conclusions Illness pattern parameterization transforms medical databases into more divertive forms. Transformation has been done from symptoms expressed in semantic forms to parametric models of symptoms described by sets of numbers. Transformation is created by means of the "dia monds" concept which realizes weighting of symptoms. Such weighted symptoms allow us to obtain higher diver sification between illnesses, and higher diagnosis selec tivity as well. Medical diagnosis consists of a cascade of two classi fiers. The first classifier is always taken over all analyzed databases as the product in symptoms space between patient symptoms and illness symptoms. Such product filters from all databases only such parts that contain symptoms observed during the patient checkup. This operation creates an Lx subset. The second classifier in the cascade is taken in the vector space of parameters as the value of the Euclidean distance measured between patient vector and individual vectors of Lx subspace. The created algorithm is of relatively low complexity, and the selectivity of diagnosis calculated in parameter ized space is higher than the calculated one just in equiv alent symptoms space. The concept of "diamonds" looks for hidden properties of medical data. To the authors' knowledge, such a concept for patient diagnosis is to some extent new among classification methods applied so far in medicine. Because of its simplicity and also its relatively low complexity, we decided to present it. We should say that parameterization via geometri cal models is not actually new. There exists a socalled "spider net" model which seems to be nearest to our solu tion described above. Similar efforts have been made earlier [4], but it only proposes the possibility to obtain numbers in the place of descriptive forms of symptoms. The method for pulling out the hidden data of symptoms has not been proposed there. Also, weighting of patient Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis43 symptoms has not been proposed the form contained in our model. The proposed method also produces an alldigital model of illnesses and patient symptoms. The number of values that are taken into account during diagnosis crea tion is higher than that in common models of semantic data for symptoms. A new feature is that symptoms inside each illness are weighted, which points out how impor tant each symptom is. Calculated weights depend only on data structure and are absolutely independent of a single, not an objective expert opinion. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Medical data preprocessing for increased selectivity of diagnosis

Loading next page...
 
/lp/de-gruyter/medical-data-preprocessing-for-increased-selectivity-of-diagnosis-TT2S2a3D4x
Publisher
de Gruyter
Copyright
Copyright © 2016 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2015-0041
Publisher site
See Article on Publisher Site

Abstract

In this review, we present a framework that will enable us to obtain increased accuracy of computer diag nosis in medical patient checkups. To some extent, a new proposition for medical data analysis has been built based on medical data preprocessing. The result of such prepro cessing is transformation of medical data from descriptive, semantic form into parameterized math form. A proper model for digging of hidden medical data properties is presented as well. Exploration of hidden data properties achieved by means of preprocessing creates new possibili ties for medical data interpretation. Diagnosis selectivity has been increased by means of parameterized illnesses patterns in medical databases. Keywords: data classification; hidden data properties; medical diagnostics. new medical investigation. It is caused by the pro gress of illness, and also by patient's natural defence against the disease. So, we cannot create statistics of diagnosis results as an assumed repeatable medical measurement. The aboveenumerated features of diagnosis make the medical diagnosis simultaneously always incomplete and uncertain. As a result, each diagnosis usually suggests a number of illnesses, and only the physician's experience decides the correctness of the diagnosis. What can we do to create computer tools for medical diagnosis? Such efforts started around 1972 with Bayes' net application in de Dombal math model of diagnosis [1]. Today computer diagnosis is not standard in medical treatment against dynamic growth of the medical data processing by means of computer methods. Among the computer methods for medical diagnosis, the basic method is the exploration of the classifiers model [2]. We also propose some classifier applications but only after medical data preprocessing from the symptoms' view of the illness to the parameterized view of the illness. Such a transformation allows us to include symptoms weighting in the objective, expertindependent manner. Symptoms understood as equivalent in medical data bases sometimes makes the similarity of illnesses very high. Using the parametric model of illness, one can obtain higher diversity of medical database contents. We prove that computer diagnosis is of higher sensitivity as well. The literature in the field of medical data process ing is huge, so the references could be more extensive of course. But the number of publications seems to be too high to analyze fully in this short review. Therefore, only basic citations which match with our idea and model are included here. Introduction and motivation What are the basic features of data from medical diagno sis? Enumeration could be like: a. The time of diagnosis creation is much shorter than that of illness evolution, and duration. b. The number of registered symptoms during patient checkup is usually far lower than the number of all of the symptoms characteristic for the illness. c. The presence, duration, and exposition of the symp toms depend on individual patient reaction, and symptoms intensity may be too weak to be observed during a short checkup. d. Repetition of diagnoses in the exactly same condi tions as the examination of previous symptoms is not possible. Each next diagnosis is always different from the previous one, and must be treated as a completely *Corresponding author: Andrzej Walczak, Department of Cybernetics (WCY), Military University of Technology (WAT), Warsaw, Poland, E-mail: awalczak@wat.edu.pl Michal Paczkowski: Department of Cybernetics (WCY), Military University of Technology (WAT), Warsaw, Poland Background The question appears when and whether symptom indicates illness without any doubt? The only case is when a single symptom is characteristic for one and only one illness. Let 40Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis us call such a symptom a "diamond of the first kind". It can also called "pathognomonic" in medicine. A different situation arises when a symptom is unique for one illness but is present also in number, say k, of dif ferent illnesses. Let us call such a symptom a "diamond of the second kind". The third situation is when a group of symptoms are always present together within the symptoms of illness. This is especially true when risk factors coincide with observed symptoms. We will call such a group of symp toms, taken together with risk factors, as a "diamond of third kind". We created a medical database for asthma, skin allergy, and also for skin illnesses. It was part of a medical project within the Innovative Industry Operating Program [3]. This database consists of a descriptive, semantic form of data related as: symptom valuemedical proce dure for this value examinationillness. All symptom values are equivalent in such a form of data. Such data are in fact only the descriptive illness pattern, and are rather common in medicine. All data in our database are created in accordance with the IDC10 standard of illness coding. Let us calculate the level of similarity among such common patterns by means of logic operation such as the Jaccard distance: J dAB = 1 - skin illness and skin allergy. Some results are shown in Figure 1 to illustrate the presence of very high intrinsic similarity of illnesses. One can see in the figure that most of the illness patterns (codes IDC10 are applied) are very similar to each other. So, if patient symptoms belong to the illnesses subset with intrinsic similarity, then diagno sis quality must rather be poor. Obvious questions arise: Is it possible to obtain an illness pattern in the form that allows avoiding such big intrinsic illness similarities? How can we get more diver sity in our database and also, as a result, more sensitive diagnosis? Can we determine illnesses using symptoms described above as "diamonds"? Methods and models Let us assume that each single symptom can be modeled as a shape, say a square, and is characterized by means of the following numbers: the area of this square, assumed weight, and circumference. So, each symptom can be described by those three numbers. Say all equivalent symptoms have those numbers equal to 1. If the symptom is a "diamond of the first kind", its weight, area, and also circumference have those values correct, the same as all rest of the symptoms of the consid ered illness because of meaning such "diamond" during the diagnosis creation. In the circumstances when illness, possesses the number of symptoms equal to l symptom in the database, then such a single "diamond" has (l­1) times bigger weight, area, and circumference than each other symptom of the considered illness. When the measured symptom is a "second kind of diamond", its weight, field, and circumference are (l­1)/k times bigger than the other symptoms for the illness, and k is the number of illnesses in which we can observe the examined symptom. When patient illness is manifested as a "third kind of diamond", then all of the group of symptoms expand its weight, area, and circumference adequately over the rest of the symptoms in examined illness appropriately. Finally, one can express symptoms of the illness just using numbers that can be assumed as a vector component in threedimensional spaces. Such space is composed of weight, area value, and circumference of symptoms. The presented "geometrical" description of our reasoning allows us to understand the core model. We also obtain a method for symptom weighting in a way that is objective and inde pendent of an expert point of view. Our symptom weighting arises strictly from knowledge about diseases. All medical databases have been transformed from the semantic shape into vector space as described above. A B A B A B - A B A B (1) In Jaccard distance expressed in formula (1), A and B are sets of symptoms for illness A and illness B, respectively. For illustration purposes, the Jaccard distances for illness similarity were calculated for the database for Illness code B36 B85 C84.0 H26.8 L13.0 L20.0 B36 1.00 0.91 0.78 0.72 0.12 0.13 B85 0.91 1.00 0.85 0.77 0.14 0.14 C84.0 0.78 0.85 1.00 0.90 0.12 0.13 H26.8 0.72 0.77 0.90 1.00 0.12 0.12 L13.0 0.12 0.14 0.12 0.12 1.00 0.61 L20.0 0.13 0.14 0.13 0.12 0.61 1.00 Figure 1:Example of Jaccard coefficient between illnesses chosen from part of the created database. Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis41 Now, we have two equivalent forms of medical database. One only describes illnesses by means of symptoms, and the other by means of numbers. In the second database, illnesses intrinsic similarity is much lower than that in the first one. Moreover, illnesses with "diamonds" are distin guished from the content of database. k in accordance with the concept of "diamonds". When a "diamond of the first kind" is recognized, the symptom in that case k is of course equal to 1. Now, illness L is a vector of longitude: L= ( w ( m, k ) + f i 2 i ( m, k ) + ci2 ( m, k )) . (5) Medical diagnosis in the parameterized illness model Let X be the space of data x from the patient examination. Let us also describe illness L as a set of symptoms li: L = l1 , ..., lm A proper vector constructed as a result of current diagnose is of the form: x= ( w ( m( L ), k( L )) + f i 2 p x x 2 p 2 ( m( Lx ), k( Lx )) + c p ( m( Lx ), k( Lx ))) . (2) (5a) And, the final diagnosis is created as an outcome of the shortest Euclidean distance taken over all the trans formed database for each patient instance: d( Lx , x ) = Lx - x , (6) In the above formula, m denotes the number of all recognized and described symptoms of illness L. In each instance, xX is related to all possible L2L. Each medical diagnosis creates pairs (x,Lx)X×2L. The Lx item means in each illness which symptoms match the symptoms present in patient checkup. During the presented transfor mation T from semantic representation into vector space, each symptom transforms into T : li wi ( m, k ), fi ( m, k ), ci ( m, k ) , (3) where wi, fi, and ci are the symptoms weight, field, and perimeter calculated in accordance with the idea of "dia monds", respectively. Values of weight, field, and perim eter for each single symptom i always depend on the values of m and k in accordance with the concept of "dia monds", and the resulting rules for calculation of w, f, and c. So, those three numbers are the math model for each symptom. Adequately for each diagnosis, T : xi { w p ( m( L ), k( L )), fp ( m( L ), k( L )), c p ( m( L ), k( L ))}. (4) Components wp, fp, and cp are calculated appropri ately in accordance with the core idea of "diamonds" for patient checkup. appropriately. Here, the most charac teristic feature of the proposed model is seen. The vector component of patient checkup differs from one illness to another even if describes the same symptom. It is because wp, fp, and cp depend on illness L data structure directly. We can say that the patient's data measured during checkup creates a cloud of points not a single point in the created vector space (see formula 4). It is because the vector com ponents wp, fp, and cp depend on the number of symp toms m in each possible illness Lx, which suits symptoms obtained during patient checkup x, and also depends on where Lx means that the illnesses subset of L is the sense of the set product LX0, and the instance x and subset of illness Lx are transformed into vector space of parameter ized illnesses. Simultaneously, Lx drops part of "clouds of points" generated in formula (5a) in a manner that fits symptoms inside L and x appropriately. Results The example of illness space is shown in Figure 2 for skin allergies. One can see that the diversification of illnesses is higher in the common database. And, examples of diag noses are shown in Figures 3 and 4. Two very similar illnesses have been examined as an example of diagnosis after patient checkup. The param eterized vector of illness is equal to [wi, fi, ci]=[894, 894, 228], and patient checkup produces vectors equal to [wp, fp, cp]=[641, 641, 184], while fish skin is compared, and [wp, fp, cp]=[252, 252, 140] while pityriasis rosea is com pared. Euclidean distance is equal to d(Lx,x)=359,7 and d(Lx,x)=901,8 appropriately. The final diagnosis is pointed out for ichthyosis vulgaris. It remains in accordance with physicians' diagnosis. We must underline that "first kind of diamonds" are absent for those two illnesses. One can see that concept of "diamonds" causes that symptoms reg istered during patient checkup are differently measured in vector space for each considered illness. Values of [wp, fp, cp] are always measured with relevance to vectors of each 42Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis 200 150 f 100 50 600 0 0 400 c 200 400 w 600 200 800 0 800 300 200 f 100 0 0 400 c 600 200 400 w 600 800 0 Figure 4:Ichthyosis vulgaris (IDC-10, Q80.0) diagnosis in parameterized illness space. Figure 2:The example of vector space for illnesses from the considered database. 200 150 f 100 50 0 0 200 400 w 600 800 0 200 800 600 400 c Figure 3:Pityriasis rosea (IDC-10 L.42) diagnosed in parameterized illness space. illness in the database. The smallest distance defined for patient's and illness' vectors indicates the nearest illness with relevance to symptoms measured during patients checkup. So, the same checkups produces a cloud of vectors for patient diagnosis. Conclusions Illness pattern parameterization transforms medical databases into more divertive forms. Transformation has been done from symptoms expressed in semantic forms to parametric models of symptoms described by sets of numbers. Transformation is created by means of the "dia monds" concept which realizes weighting of symptoms. Such weighted symptoms allow us to obtain higher diver sification between illnesses, and higher diagnosis selec tivity as well. Medical diagnosis consists of a cascade of two classi fiers. The first classifier is always taken over all analyzed databases as the product in symptoms space between patient symptoms and illness symptoms. Such product filters from all databases only such parts that contain symptoms observed during the patient checkup. This operation creates an Lx subset. The second classifier in the cascade is taken in the vector space of parameters as the value of the Euclidean distance measured between patient vector and individual vectors of Lx subspace. The created algorithm is of relatively low complexity, and the selectivity of diagnosis calculated in parameter ized space is higher than the calculated one just in equiv alent symptoms space. The concept of "diamonds" looks for hidden properties of medical data. To the authors' knowledge, such a concept for patient diagnosis is to some extent new among classification methods applied so far in medicine. Because of its simplicity and also its relatively low complexity, we decided to present it. We should say that parameterization via geometri cal models is not actually new. There exists a socalled "spider net" model which seems to be nearest to our solu tion described above. Similar efforts have been made earlier [4], but it only proposes the possibility to obtain numbers in the place of descriptive forms of symptoms. The method for pulling out the hidden data of symptoms has not been proposed there. Also, weighting of patient Walczak and Paczkowski: Medical data preprocessing for increased selectivity of diagnosis43 symptoms has not been proposed the form contained in our model. The proposed method also produces an alldigital model of illnesses and patient symptoms. The number of values that are taken into account during diagnosis crea tion is higher than that in common models of semantic data for symptoms. A new feature is that symptoms inside each illness are weighted, which points out how impor tant each symptom is. Calculated weights depend only on data structure and are absolutely independent of a single, not an objective expert opinion. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Mar 1, 2016

References