Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Understanding Gender Biases and Differences in Web-Based Reviews of Sanctioned Physicians Through a Machine Learning Approach: Mixed Methods Study

Understanding Gender Biases and Differences in Web-Based Reviews of Sanctioned Physicians Through... Background: Previous studies have highlighted gender differences in web-based physician reviews; however, so far, no study has linked web-based ratings with quality of care. Objective: We compared a consumer-generated measure of physician quality (web-based ratings) with a clinical quality outcome (sanctions for malpractice or improper behavior) to understand how patients’ perceptions and evaluations of physicians differ based on the physician’s gender. Methods: We used data from a large web-based physician review website and the Federation of State Medical Boards. We implemented paragraph vector methods to identify words that are specific to and indicative of separate groups of physicians. Then, we enriched these findings by using the National Research Council Canada word-emotion association lexicon to assign emotional scores to reviews for different subpopulations according to gender, gender and sanction, and gender and rating. Results: We found statistically significant differences in the sentiment and emotion of reviews between male and female physicians. Numerical ratings are lower and sentiment in text reviews is more negative for women who will be sanctioned than for men who will be sanctioned; sanctioned male physicians are still associated with positive reviews. Conclusions: Given the growing impact of web-based reviews on demand for physician services, understanding the different dynamics of reviews for male and female physicians is important for consumers and platform architects who may revisit their platform design. (JMIR Form Res 2022;6(9):e34902) doi: 10.2196/34902 KEYWORDS gender; natural language processing; web-based reviews; physician ratings by customer; text mining to help them decide which physician to see [1]. Although ratings Introduction are widely used, they are also sparse, with many physicians having only 1 or 2 ratings on any given site. In addition, up to Background 90% of all web-based reviews are positive [2]. Positive reviews Web-based reviews of physicians play an important role in have been estimated to increase physician demand by as much patients’ searches for providers. A 2017 National Institutes of as 7% [3]. Thus, to fully understand the impact of web-based Health survey found that 39% of adults used web-based reviews https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 1 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al reviews on individual providers and the health care system as attacks treated by male physicians had significantly higher a whole, it is important to identify when and why physicians mortality than female patients treated by female physicians. are given negative reviews and whether those negative reviews Web-based physician reviews have been studied from diverse actually reflect reality. angles. We divided our summary of the literature into review There are many reasons to believe that web-based reviews may content, how reviews correlate with peer ratings, fraudulent be unhelpful. For example, they are subject to review fraud, as reviews, sentiment analysis, and finally, the impact of physician Hu et al [4] estimated that 10% of all web-based reviews are reviews. fraudulent. In addition, they are typically nonexpert reviews of Content of Physician Reviews expert services; that is, credence services, and customer reviews Hao and Zhang [12] implemented latent Dirichlet allocation of credence services are generally thought to be unhelpful [5]. (LDA) as a topic modeling technique for textual review of data For both reasons, Dr Peter Carmel, president of the American about Chinese physicians in 4 major specialty areas and Medical Association, has argued that “anonymous online identified popular review topics, including professionalism and opinions of physicians should be taken with grain of salt and showing appreciation for physicians’ detailed symptom should not be a patient’s sole source of information when descriptions [12]. A second study identified patient satisfaction, looking for a new physician” [6]. Studies have further found staff, and access as important themes in the reviews studied that although web-based reviews matched peer evaluations of [13]. physicians in outpatient specialties, this was not the case for inpatient surgical specialties [7]. An extensive analysis of reviews from US health care review websites by Thawani et al [14] found that female physicians Gender adds another layer of complexity to questions related receive lower ratings overall, even after accounting for specialty. to the influence, validity, and value of web-based reviews. Comments about female physicians are more likely to be related Studies have shown that female physicians are given more to their interpersonal skills, whereas for male physicians, negative reviews and that women are rated as less amicable [8]. comments focus more on professionalism and helpfulness. Although previous studies have shown differences in review Marrero et al [15] further examined a subset of the same data content based on physician gender, our study adds a critical to understand the influence of gender on how patients both dimension by considering an external indicator of physician perceive and evaluate their surgeons, confirming that women quality. We used data from the Federation of State Medical are evaluated more positively for social interactions and men Boards on physicians who have been sanctioned by their state for technical aptitude. medical boards for unsuitability to practice medicine, either for negligence, malpractice, or other improper behavior. Sanctions Validity of Physician Reviews range from probation to complete revocation of the offending McGrath et al [7] examined the validity of patient-generated physician’s medical license. As receiving a sanction is an web-based physician reviews and found that validity is affected objective marker of low-quality medical care, at least for some by physician specialty. For specialties such as family medicine, physicians, looking at sanctions gives us a way to quantify allergies, internal medicine, and pediatrics, the web-based physician quality, which is a notoriously difficult task. We ratings of physicians listed as a top doctor by their peers are showed that women receive systematically different reviews significantly higher than the ratings of those without this from men and that female physicians who will be sanctioned peer-generated quality indicator. Kordzadeh [16] showed that in the future are rated lower and receive more negative the ratings listed on hospital websites are systematically higher comments in their reviews than similarly situated male than those on outside commercial physician rating sites such as physicians. RateMDs and Google Reviews. Related Literature Sentiment of Physician Reviews Physician’s Gender Wallace et al [17] developed a factorial LDA model to jointly Previous studies have explored gender differences in light of identify both sentiment and topics from reviews. By how physicians consult and communicate with their patients. incorporating the factorial LDA output into regression analysis, Studies have concluded that female physicians are generally they further found that positive sentiment is associated with more communicative and interpersonal than male physicians health care measurements such as patients’ revisit probability as they focus more on building partnership, asking questions, and health care costs and that the model can explain more and providing information, which results in long medical variance than models using only rating information. Similarly, appointments with female physicians [9]. This long consultation Rivas et al [18] developed a dependency tree–based classifier duration reduces the volume of consultations that female to capture patterns from each review, which can be used to sort physicians can provide [10]. Some studies have explored the physician reviews into a 2D classification system based on topic reasons for long consultations and the role of gender in medical and polarity. Waltena et al [19] focused on the impact of decision-making. For instance, when diagnosing coronary heart sentiment on topic extraction in hospital reviews, and by adding disease, female physicians are more engaged with the historical 2 topics representing positive and negative sentiment in latent presentation of the patient’s condition and more likely to be semantic analysis, the authors successfully reduced the bias affected by the patient’s gender than male physicians. owing to sentiment on the subjects of topics. Greenwood et al [11] showed that female patients who had heart https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 2 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al sanctioned physicians from the Federation of State Medical Impact of Physician Reviews Boards. We removed any reviews of physicians that were made The impact of web-based physician reviews on patient choice after they were sanctioned, so any official sanction does not remains an active area of research. Xu et al [3] explored the affect the content of the reviews. In total, we obtained 403,470 interaction between web-based physician reviews and physician reviews of 134,973 physicians across the United States. In our demand and concluded that the number of reviews and data, men were more than twice as likely as women to be disclosure of reviewer identity are positively related to physician sanctioned; 1.7% (1629/95,831) of all male physicians were demand but negatively correlated with review length. Through sanctioned, whereas only 0.64% (250/39,142) of female a counterfactual experiment, they found that strategies for physicians were sanctioned. improving ratings (eg, disclosing reviewers’ identities and limiting review length) can increase the demand for a physician The web-based reviews from RateMDs were merged with the by as much as 7.24%. However, improving the operational state medical board sanction data by matching physician name process or platform design can increase physician demand even (including matching using a dictionary of common nicknames further. Li et al [20] studied how web-based reviews and [eg, Kate for Katherine]), state, specialty, medical school, and physicians’ gender affect patients’ primary care physician graduation year (where available). The physicians in the sanction choices. The results indicated that among physicians whose data who we could not perfectly match owing to multiple skills are endorsed in reviews, if a female physician is endorsed matches or no matches (and which amounted to <5% of the for their interpersonal characteristics, such as compassion and sample) were excluded from the study. personableness, they are more likely to be chosen than a male Methodology physician endorsed for the same reasons. However, this kind of gender effect is not observed among physicians endorsed for Overview their technical skills. Bedside manner, diagnosis accuracy, The field of text mining and natural language processing is patients’ waiting time, and consultation length are critical in growing rapidly, with many emerging techniques available to patients’ choice of a physician [3]. analyze text and discover patterns in documents via automated Our study is the first to analyze the content of patient reviews procedures. In their book, Foundations of Statistical Natural of physicians across genders using natural language processing Language Processing, Manning and Schutze [22] stated that tools that accounts for differences in ratings and sanction status. the availability of large text corpora has changed the scientific This allowed us to understand both the set of criteria on which approach to language in linguistics and cognitive science. male and female physicians are evaluated and the impact of Therefore, phenomena that were previously undetectable or poor performance (as measured by sanctions). We further seemingly uninteresting have become the central focus of lexical applied an emotional index to understand, in a multidimensional analysis. Taking advantage of some of these new developments, way, the tones of the different types of reviews based on ratings, in this study, we implemented paragraph vector (as described gender, and sanction status. More specifically, in this study, we in the following sections) and used a word-emotion association aimed to determine whether reviews of women systematically lexicon on the corpus of physician reviews to analyze the data differ from those of men. In particular, we aimed to discover in a nuanced manner. whether female physicians are rated lower at baseline than male Data Preprocessing physicians and whether female physicians experience larger To make the raw data analyzable, we performed a series of reputational penalties than male physicians for low-quality tasks. First, the reviews were converted to lowercase, so that services (as indicated by sanctions from the state medical board). capital letters are treated the same as lowercase letters. Second, punctuation was removed because it typically adds unnecessary Methods noise to word models. Third, stopwords, defined as unimportant Data words that are overly common (eg, “the,” “and,” and “is”) were removed using a freely available System for the Mechanical Our data were collected from 2 sources: physician reviews were Analysis and Retrieval of Text stopword list built by Salton and obtained from RateMDs and combined with physician sanction Buckley and sourced from web-based Appendix 11 of the paper data from the Federation of State Medical Boards [21]. by Lewis et al [23]. Fourth, we removed numbers because, The data from RateMDs include physicians’ average ratings on similar to punctuation, they add noise to the analysis. On the a 1- to 5-star scale. Reviewers rate the overall experience and remaining words in the corpus, we performed stemming using 4 other defined categories: helpfulness, knowledgeability, the Porter stemming algorithm [24,25]. Stemming is the act of punctuality, and staff. The data further contain the text of the reducing words to their root form (eg, “practice,” “practicing,” reviews. and “practiced” become “practic”). This allows models to treat these words as one concept rather than as separate ideas. As we State licensing boards issue sanctions to physicians for issues had a limited-sized data set, we applied all the preprocessing related to their suitability to practice medicine in each state. steps mentioned previously to maximize insights from a concise Reasons for sanctions include, but are not limited to, serious vocabulary. Although the removal of stopwords resulted in malpractice, performing unnecessary treatment, fraudulent some locally unnatural word sequences (such as articles not billing, and abuse of patients. We collected every review posted appearing before nouns), we found that this did not hinder our between October 2004 and August 2011 and matched it by analysis. name, location, state, and specialty with the database of https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 3 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al differences between the subsets. For example, we computed the Analytical Techniques similarity of wait to the female corpus of reviews, computed In this study, we applied a paragraph vector framework [26], a the similarity of wait to the male corpus of reviews, and natural language processing method that represents each word calculated the difference in these scores. Our analysis focused or document as a dense vector (ie, a location in space), on the words with the highest absolute difference between the called an embedding, which is then used as an input to train a similarity scores for one subset of reviews (typically female) model to predict co-occurrence of words. We used the paragraph and the complementary subset (typically male). vector distributed bag-of-words model, which uses words from To understand the emotional nature of the reviews, we used the a given width window to predict the next word in the document. NRC word-emotion association lexicon [27] to attribute In this framework, “kind” is located closer to “nice” than sentiment and emotional scores to the corpus (NRC stands for “surgery” because “nice” has a much higher probability than the National Research Council Canada, but the lexicon is “surgery” of being found in similar contexts as “kind.” We used commonly referred to as the NRC emotion lexicon). This lexicon the paragraph vector model to generate an embedding of words, created an afinn dictionary by rating words on a scale of 8 which can be used to calculate the similarity (via cosine emotions: anger, anticipation, disgust, fear, joy, sadness, similarity) between any set of words or documents. Henceforth, surprise, and trust. Using the scores from this lexicon, we were we refer to the cosine similarity between words or documents able to both rate reviews on an aggregate emotional scale (how as the similarity score. emotional the document is as a whole) and rank them for each For each data slice (eg, sanctioned physicians), we trained a of the 8 emotions. More specifically, for each data cut (eg, paragraph vector model. Once the model was trained, we could sanctioned female physicians), each word in each physician’s use the embedding to identify words associated with the medical review was scored based on the emotional score of the word, reviews of different types of physicians (eg, based on gender). and then, average physician score was derived by averaging all To compare specific differences across a physician population, physicians’ emotional scores. Understanding these emotional we concatenated every review from one specific subset of data scores allowed us to develop a deep understanding of the criteria (eg, sanctioned male physicians) and found the similarity scores that patients use to evaluate female and male physicians and of this document with each word within the corpus. Then, we how those criteria differ. repeated this process for the complementary subset (eg, We have summarized the methodological approach in Figure sanctioned female physicians) and compared the similarity scores for each subset. We extracted the words with the largest Figure 1. Analysis flowchart. NRC: National Research Council Canada; OBGYN: obstetrics and gynecology. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 4 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al have listed the exact parameter settings in Multimedia Appendix Implementation and Hypertuning 1 and summarized the results of the paragraph vector model for We used doc2vec [28], a Python package implementing “knowledgeable,” “wonderful,” “caring,” and “rude” in Figure paragraph vectors, to learn how patients review physicians S2 in Multimedia Appendix 1. differently across gender, sanctions, and ratings (both in isolation and interaction). We performed a standard doc2vec Ethical Considerations implementation to learn the paragraph vectors of the following: All the data used in this study are publicly available and do not Gender (male and female) contain identifiable private information about individuals. Thus, Composite label of gender and sanction (female sanctioned, this study was not deemed to require institutional review board female unsanctioned, male sanctioned, and male review. After merging sanction data with review data by name, unsanctioned) specific physician identities were removed from the data set Composite label of gender and rating (female high rating, and not used in the analyses. female low to medium rating, male high rating, and male low to medium rating), where we defined high rating as ≥4 Results stars and low to medium rating as <4 stars Data Overview To overcome majority bias, we sampled an equal number of Figure 2 shows the number of physicians in each specialty by reviews for each group. We trained the models independently gender. Internal medicine and family practice are the 2 most for the different metadata cuts, rather than treating each separate common specialties in our data. The figure highlights that there review as an individual document. By fitting the different groups are more male physicians than female physicians in every separately, we were able to understand the specific lexicons specialty; overall, 29% (39,142/134,973) of the physicians in associated with each metadata cut (gender, sanction, and rating). the sample are women. This gender imbalance is easily Then, we analyzed the similarity scores of words to their noticeable in the more common disciplines; internal medicine respective corpora and compared the scores. has the highest number of female physicians, but there are still We pretrained the paragraph vector framework, using the twice as many male physicians. The imbalance is even more continuous bag-of-words algorithm to tune the hyperparameters, prominent in some small disciplines such as orthopedic surgery by testing the most similar words to several words such as and neurological surgery, where men outnumber women 23:1 “knowledgeable,” “wonderful,” “caring,” and “rude.” We ran and 15:1, respectively. Obstetrics and gynecology (OBGYN) multiple variations of the model to identify the best settings. and pediatrics departments are more balanced in terms of gender, The results were consistent across different parameters, which with an approximately even ratio of men to women. gave us confidence in the robustness of the final model. We Figure 2. Number of physicians in each specialty, broken down by gender. Table 1 highlights the average star ratings overall and for each When the ratings of sanctioned male physicians are compared of the main categories present in the reviews (helpfulness, with those of sanctioned female physicians, the absolute knowledgeability, punctuality, and staff). This pattern is differences are of similar magnitude; however, owing to the consistent across specialties including internal medicine and small size of the sanctioned population, the differences are not OBGYN. Furthermore, unsanctioned physicians receive higher statistically significant. Among sanctioned physicians, female ratings than sanctioned physicians. We note that the staff physicians receive lower ratings (by an average of approximately category ratings may not be reflective of the physician’s medical 0.1 stars) than male physicians (not considering specialties). capabilities. In all cases, the average rating for men is higher The difference between genders among sanctioned physicians than that for women. These differences are statistically is greater than that among unsanctioned physicians, especially significant (P<.001, evaluated with 2-tailed t tests) both when for those rated around average for helpfulness and comparing unsanctioned male physicians with unsanctioned knowledgeability. A detailed breakdown of the number of female physicians and when comparing unsanctioned female sanctioned physicians is provided in Table S1 in Multimedia or male physicians with sanctioned female or male physicians. Appendix 1. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 5 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Table 1. Average star rating (out of 5 stars) overall and for the 4 RateMDs score categories for the whole sample of physicians. Ratings are separated by gender and sanction status. Categories Full sample (n=134,973), Female physicians (n=39,142, 29%), mean (SD) Male physicians (n=95,831, 71%), mean (SD) mean (SD) Unsanctioned Sanctioned (n=250, Unsanctioned Sanctioned (n=1629, (n=38,892, 99.36%) 0.64%) (n=94,202, 98.30%) 1.69%) Overall 3.86 (1.12) 3.81 (1.12) 3.41 (1.21) 3.89 (1.12) 3.52 (1.24) Helpfulness 3.89 (1.36) 3.85 (1.35) 3.47 (1.44) 3.90 (1.36) 3.59 (1.48) Knowledgeability 4.03 (1.25) 3.99 (1.24) 3.62 (1.35) 4.06 (1.25) 3.75 (1.38) Punctuality 3.83 (1.18) 3.77 (1.18) 3.39 (1.27) 3.86 (1.18) 3.43 (1.31) Staff 3.67 (1.3) 3.61 (1.3) 3.02 (1.41) 3.70 (1.3) 3.19 (1.45) On average, each physician receives 3 reviews, with an average However, on the whole and across specialties, the group with length of 55.7 (SD 47.65) words. In general, lower-ranked the longest reviews are the low to medium–ranked women. physicians receive longer reviews than higher-ranked physicians; Given what we know from our subsequent content analysis, this people have more to say about an experience they are is because patients have longer and more negative comments dissatisfied with. On average, women receive longer reviews to make about women, whereas reviews of male physicians are than men. An exception is in the OBGYN field—in this short and more positive. As highlighted in Table 2, these trends specialty, patients have more to say about a sanctioned male hold when we break down the length analysis by specialty. physician than they do about a sanctioned female physician. Table 2. Average review length for sanctioned and unsanctioned male physicians and female physicians in all specialties, internal medicine, and OBGYN , measured in number of words. Categories All specialties (n=134,973) Internal medicine (n=33,549) OBGYN (n=15,001) Female Male Female (n=9087, 27.09%), Male Female (n=7268, 48.45%), Male (n=39,142, (n=95,831, mean (SD) (n=24,462, mean (SD) (n=7733, 29%), mean 71%), mean 72.91%), mean 51.55%), (SD) (SD) (SD) mean (SD) Overall 50.1 (36.4) 45.7 (36.1) 45.8 (36.3) 41.5 (35.2) 58 (34.6) 54.5 (34.9) Sanctioned 48.7 (36.6) 47.6 (37.1) 55.1 (34) 42.9 (35.8) 46 (30.8) 51.2 (38.7) Unsanctioned 50.2 (36.4) 45.6 (36.1) 45.7 (36.3) 41.4 (35.2) 58.2 (34.7) 54.5 (34.9) High rating 39 (31) 37.2 (31.3) 35.9 (31) 33.4 (29.8) 45.4 (29.6) 47.8 (32) Low to medium rat- 61.6 (38) 56.3 (38.8) 57.3 (38.5) 53 (38.9) 69.7 (34.8) 65 (36.7) ing OBGYN: obstetrics and gynecology. The nature of physicians’ work differs between specialties, highly ranked reviews). For each analysis (eg, male physician which in turn may influence web-based reviews. Therefore, to vs female physician in highly ranked reviews), we extracted the remove the impact of specialty, our analysis in this study focused top words by similarity score to the paragraph vector of on internal medicine (the most common type of physician concatenated female reviews and concatenated male reviews, reviewed). In addition, we conducted the analysis on OBGYN respectively, in the relevant subset of data, and then compared reviews and compared the results with those for internal the differences. Our analysis focused on the words with the medicine (detailed results for the OBGYN reviews are available greatest absolute difference between the similarity scores for in Multimedia Appendix 1). This allowed us to compare results female and male reviews. Additional and complementary results across medical specialties, but the OBGYN results are are available in Multimedia Appendix 1. particularly interesting, as we can be confident that most reviews Review Comparison Between Genders are written by women, giving us further insight into the To examine the relative similarity scores of the words used in differences in results. Following these analyses, we compared the corpus to describe men and women, we extracted the top the length and emotion of the reviews. words by similarity score (omitting procedural-type words, eg, We examined the differences between gender and reviews in 3 “appt” and “said” for analysis purposes) for the subset of male ways: first, we analyzed male and female physicians; second, physician reviews and female physician reviews, as summarized we studied both gender and rating; and third, we analyzed the in Figure 3. This figure presents the top 15 words with the interaction of sanction and gender. For each of these analyses, largest difference between similarity scores to the document a separate doc2vec model was trained on the relevant corpus vector for concatenated female reviews and concatenated male (eg, the entire corpus, reviews of sanctioned physicians, or https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 6 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al reviews, with the left pane showing the 15 words that scored evaluated, which supports the findings of previous analyses highest for the female reviews and the right pane showing those [14-16]. From this general comparison, without considering that scored highest for male reviews. rating, we see that women are more often evaluated with a focus on punctuality, whereas men are much more likely to be praised This led us to several interesting observations. For example, for their technical abilities and bedside manner. To confirm that “assist,” “neg,” “difficult,” “wait,” “punctual,” and “issue” were these underlying frequencies are statistically significant, we scored as more similar to female physicians’ reviews than male performed a chi-square test for the top words presented in Figure physicians’ reviews. In contrast, “superb,” “gentl,” “famil,” 3 (the null hypothesis was that there is no difference in these “skill,” “humor,” and “great” were scored as more similar to frequencies, and the alternative hypothesis was that there are male physicians’ reviews. The key takeaway is that even before differences in these frequencies not equal to 0) and found all of we incorporate sanction and ranking data, we see stark them to be significant (P<.001). differences between the ways male and female physicians are Figure 3. Difference in similarity scores for top words in reviews of male and female internal medicine physicians. The x-axis represents the absolute difference in similarity score for the given words to the document vector of concatenated reviews for all women and all men. The figure displays the top 15 words; the biggest differences in similarity scores are for the female subset of reviews over male reviews (left pane) and the male subset of reviews over female reviews (right pane). roles (eg, “assist” and “staff”). In contrast, the corpora of male Review Comparison Between Gender and Rating physicians’ reviews are more likely to contain words that are We used the approximate mean as the standard criterion for medically technical (eg, “hospit,” “cardiologist,” “skill,” or determining whether a rating was high (>4 stars) or low to “diagnostician”) or simply glowing endorsements (eg, medium (≤4 stars). Then, we created document vectors for the “brilliant,” “superb,” and “greatest”). These findings are following subsets of concatenated reviews: (1) reviews rating summarized in Figure 4A. female physicians highly, (2) reviews rating male physicians Despite these discrepancies, we note that highly ranked highly, (3) reviews rating female physicians as medium to low, physicians generally garner positive text reviews regardless of and (4) reviews rating male physicians as medium to low. We gender. Gender differences become much more pronounced repeated this analysis, focusing first on high-ranked female and when focusing on low-ranked physicians. As summarized in male physicians and second on low to medium–ranked female Figure 4B, the words with the highest similarity scores for and male physicians. The results are shown in Figure 4. We reviews for low to medium–ranked women are objectively much again compared the top words by absolute difference in more negative (eg, “unprofession,” “cold,” “issu,” “dismiss,” similarity score between men and women within the high and “notveri”) compared with the reviews of low to reviews first and then within the low to medium reviews. medium–ranked men (eg, “skill,” “sens,” “famili,” “humor,” For highly ranked physicians, the words that are the most “great,” and “excel”). The only objectively negative word that associated with female physicians’ reviews over the corpora or is much more likely to occur in these male physicians’ reviews male physicians’ reviews tend to either describe the timeliness is “arrog” for arrogance (a quality more often attributed to men of the visit (eg, “wait” and “rush”), liken female physicians to than to women). workers in supporting roles, or evaluate staff in those supporting https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 7 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 4. Difference in similarity scores for top words for (A) high-ranked and (B) low to medium–ranked men and women in internal medicine. The x-axis represents the absolute difference in similarity score for the given words to the document vector of concatenated reviews for all (A) high-ranked women and all men and (B) low to medium–ranked women and all men. The figure displays the top 15 words with the biggest differences in similarity scores for the female subset of reviews over male reviews (left pane) and the male subset of reviews over female reviews (right pane). sanctioned men’s reviews, whereas it is much more difficult to Review Comparison Between Gender and Sanction tell the difference between a sanctioned man and an As discussed previously, male physicians receive high ratings unsanctioned man. Some of the words most highly associated on average, but at the same time are more likely to be with sanctioned male physicians are “specialti,” “gentl,” sanctioned. This motivated our independent analysis of reviews “helpful,” “thank,” “skill,” and “god,” whereas some of the of sanctioned and unsanctioned physicians by gender. Owing highest scored words for sanctioned female physicians are to the low overall probability of sanctions (1879/134,973, 1.39% “receptionist,” “unprofession,” “pa,” “wait,” and “notveri.” of our sample), the reviews of unsanctioned physicians mirror Words that are exclusive to the sanctioned male lexicon include the general discrepancies between men and women. In contrast, “cardiologist,” “save,” “heart,” “hospit,” “superb,” “pleasur,” the analysis of sanctioned physicians’ reviews reveals stark and “compassion,” which highlight the stark discrepancies even gender differences, as highlighted in Figure 5. The words with further because these words do not appear even once in a the highest probability of appearing in sanctioned women’s sanctioned female physician’s review (additional details are reviews have much more negative connotations than those in available in Figures S4 and S5 in Multimedia Appendix 1). https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 8 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 5. Difference in similarity scores for top words for sanctioned men and women in internal medicine. The x-axis represents the absolute difference in similarity score for the given words to the document vector of concatenated reviews for all sanctioned women and all sanctioned men. The figure displays the top 15 words with the biggest differences in similarity scores for the female subset of reviews over male reviews (left pane) and the male subset of reviews over female reviews (right pane). When using gender (rather than rating) as the main dimension Emotion Scoring of analysis, we found that for internal medicine, reviews of men We analyzed the emotional scores of the reviews, generating are much more emotional than those of women, for both positive an emotional score for each subset of physicians. We used the and negative emotions, as demonstrated in Figure 7. A notable percentage of each top word appearing in each cut of the review exception is that women’s reviews scored high on negative corpus multiplied by the emotional score, repeated the process emotion. For OBGYN physicians (reviews that we can safely for each word in the lexicon to obtain a total score for each cut assume to be written mostly by women), the reviews are much (eg, male, female, sanctioned women, and highly rated men), more positive for men (overindexing on joy, positive, and trust), and then summed these scores within each subset. The emotion and the reviews of female physicians score notably high on analysis is the only portion of this study in which we found anticipation, disgust, negativity, and sadness. noticeable differences between the 2 specialties Next, we divided the analysis by gender and specialty, and then analyzed—internal medicine and OBGYN. Therefore, we have focused on the difference between sanctioned and unsanctioned included the results for both specialties in the main text. physicians. The results are highlighted in Figures 8A and 8B. In the plots below, the emotions are categorized as positive, For female internal medicine physicians, the results are negative, or neutral and listed alphabetically within each consistent with expectations; unsanctioned physicians score category in the following order: joy, positive, trust, anticipation, high on positive emotions, whereas sanctioned physicians score surprise, anger, disgust, fear, negative, and sadness. high on neutral and negative emotions. In contrast, for male internists, unsanctioned physicians score high across the First, we examined the differences in emotional scores between emotional scale (however, the differences are generally small). high-ranked and low to medium–ranked female physicians The pattern for OBGYN physicians is very different—among (Figure 6A) and between high-ranked and low to female OBGYN physicians, there is great variability in the medium–ranked male physicians (Figure 6B). As expected, emotional scores, whereas among male OBGYN physicians, more positive emotions are much more likely to be found in unsanctioned physicians score high on positive and neutral high ratings of both men and women, with only small differences emotions, with very little difference in emotional scores on between men and women in both specialties analyzed. negative emotions. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 9 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 6. Emotional score ratings for (A) female physicians’ and (B) male physicians’ reviews. The 10 emotions on the y-axis are categorized as positive, neutral, or negative (and arranged alphabetically within these categories). The x-axis plots the difference in the emotional score between the different groups. Positive numbers mean that an emotion scored high for high-ranked physicians, and negative numbers mean the emotion scored high for low to medium–ranked physicians. OBGYN: obstetrics and gynecology. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 10 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 7. Emotional scores by gender for internal medicine and obstetrics and gynecology (OBGYN). Figure 8. Emotional scores by sanction status for (A) female physicians and (B) male physicians for both internal medicine and obstetrics and gynecology (OBGYN). https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 11 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Finally, we focused on the differences between sanctioned men word-emotion association lexicon to assign emotional scores and sanctioned women and summarize the results in Figure S18 to 3 segments: gender, gender and sanction, and gender and in Multimedia Appendix 1. In contrast to Figure 8, where we rating. hold the gender constant and analyze across sanction statuses, Principal Findings in Figure S18 in Multimedia Appendix 1, we hold the sanction Our findings shed light on the different criteria by which patients status constant and analyze across genders. This shows the evaluate male and female physicians, and they highlight the differences in reviews for both sanctioned women and men and disparity in severity with which patients review male and female unsanctioned women and men. For both specialties, we found physicians. When we analyze the ratings of male and female small gender differences among unsanctioned physicians. physicians while holding the rating range constant, it becomes However, among sanctioned physicians, the differences are clear that women are more likely to be evaluated on their large: women, especially female internists, are interpersonal bedside manner, whereas men are more likely to disproportionately reviewed in a more negative manner. Patients be evaluated based on their perceived technical skills and in the OBGYN department tend to review sanctioned male performance. This pattern holds when analyzing reviews of physicians more emotionally, in both positive and negative low-rated or medium-rated male physicians—the lexical content terms (with the exception of disgust and negativity, for which of their reviews is still much more likely to convey high praise, sanctioned female OBGYN physicians score high on average). whereas women are much more likely to be severely criticized. Overall, we conclude that emotional scoring analysis adds a The dynamic is further exacerbated among men and women layer of depth to our understanding of the differences among who are sanctioned. It is much more difficult to discern a review lexical reviews of physicians. The differences between the of a sanctioned man from the review of an unsanctioned man specialties are even more fascinating—although we are unable by the content of the written review alone, whereas for women, to discern major differences between the specialties regarding there is a stark contrast, and female physicians are evaluated general word composition of the lexicons, the emotional much more harshly if they are sanctioned. The insight gained discrepancies between internal medicine reviews (written by a by analyzing sanctioned physicians is an important contribution mix of patients) and OBGYN reviews (written by mostly female of this study. There are baseline differences between how male patients) are extremely clear. Holding everything else constant, and female physicians are perceived, but those differences are internal medicine reviews of male physicians tend to be largely greatly magnified when the service quality is low. Sanctioned more emotional, regardless of whether that emotion is positive men still receive glowing reviews, whereas sanctioned women or negative. Reviews of female OBGYN physicians tend to be experience large reputational penalties when they deliver much more negative. low-quality care or behave inappropriately. When we add sanction status to the analysis, the dynamic It is essential to understand not only the quantitative differences becomes more complex. In internal medicine, there are no in how and why female and male physicians are evaluated but notable differences in the emotional scores of sanctioned men also the qualitative aspect of those differences. Contributing to and unsanctioned men. In contrast, reviews of sanctioned female this qualitative understanding, our findings elucidate the physicians in internal medicine show negative emotion more gender-driven difference in bases for evaluations of physicians prominently than those of unsanctioned female physicians. The by patients. Most notably, we did not see differences in the biggest difference between emotional scores in this entire emotional language used for sanctioned and unsanctioned male analysis is between unsanctioned and sanctioned male OBGYN physicians, whereas female physicians who will be sanctioned physicians—sanctioned male OBGYN physicians receive the have consistently more negative emotion associated with their most negative reviews in any subset of data analyzed. When reviews. comparing sanctioned women directly with sanctioned men, Comparison With Previous Studies sanctioned female internal medicine physicians are reviewed much more negatively than sanctioned male internal medicine An expanding stream of literature shows significant gender bias physicians, but reviews of sanctioned male OBGYN physicians in ratings, perhaps most egregiously in a case in which changing are more emotional overall, regardless of whether the emotion the name of an anonymous teaching assistant from male to is positive or negative. female lowered the average review score [29]. Our study contributes to the growing literature on how web-based medical Discussion reviews are biased by gender, highlighting that in web-based reviews, women are more likely to receive negative reviews, Overview obtain low scores, and be judged on criteria not directly related In this study, we analyzed web-based reviews of physicians and to their skills as a physician (eg, diagnostic abilities) [20,21]. how they differ based on physicians’ gender. We further sought We make a unique contribution by examining how physicians to understand the complex interaction among the physician’s who are sanctioned for inappropriate behavior, negligence, or web-based score (rating), whether they are sanctioned by a state malpractice are penalized for low-quality service. medical board, and gender, as revealed in the content of the Limitations web-based reviews. To investigate this interaction, we Our results are subject to a few limitations imposed by the data. implemented paragraph vector techniques to identify words that First, we only have review data and do not know the actual are specific to and indicative of the separate metadata cuts. quality of care delivered (except care by sanctioned physicians, Then, we enriched these findings by using the NRC https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 12 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al which we know is more likely to be poor). We do not know the if there have been sociological changes in patients’ views and types of services received, and we do not know the patient behavior related to physician’s gender, our results will not outcomes. We tried to account for these unknowns by averaging capture those recent developments. all patient reviews for each physician and comparing physicians Conclusions within subspecialties, which should control for much of the The role and influence of web-based reviews may grow as variation in services provided. However, if the medical services medicine becomes increasingly computerized, a shift that has provided within a subspecialty systematically differ between only been accelerated by the COVID-19 pandemic. As genders, there may still be some residual confounding. Second, telemedicine expands in scope and prevalence, proximity our data set does not contain physicians’ race or ethnicity, which becomes less of a limiting factor in selecting a physician; is another potential dimension of review bias. Future studies therefore, patients will rely more on web-based reviews to guide can investigate the possibility of racial and gender bias. Third, their physician choices. Given this growing role of reviews in in our data, sanctioned physicians’ reviews before the sanction physician selection, action needs to be taken to ensure that they date were combined; therefore, we could not explore the are fair and balanced. Although awareness is the first step, commonality or information signals provided by no-text reviews websites and apps that feature or contain physician reviews or by the length of individual reviews. Fourth, owing to the should also follow best practices for mitigating gender and racial small number of sanctioned physicians, we represented the bias in those reviews. For instance, as previous studies have presence or absence of sanctions with a binary indicator; shown, asking specific questions rather than providing however, sanction severity varies. Therefore, future studies can open-ended boxes for reviews can reduce bias [30]. Similarly, focus on sanction severity to provide a more detailed and highlighting the potential for unconscious bias [31] and nuanced analysis of reviews. Finally, we acknowledge that the providing a rubric for evaluations [32] can also help web-based data are a decade old at the time of publication, meaning that platforms to mitigate biases in physician reviews. Authors' Contributions The authors’ order of contribution was as follows: JB, CC, MVB, and DA. JB drafted the Methods and Results sections and conducted the final data and modeling analysis. CC conducted the initial model fitting and exploratory analysis. MVB and DA collaborated in the overall guidance and direction of the paper, acquired the data, and edited the manuscript. DA drafted the Introduction and Discussion sections. All the authors approved the final manuscript. Conflicts of Interest None declared. Multimedia Appendix 1 Additional results, charts, and tables. [PDF File (Adobe PDF File), 737 KB-Multimedia Appendix 1] References 1. Holliday AM, Kachalia A, Meyer GS, Sequist TD. Physician and patient views on public physician rating websites: a cross-sectional study. J Gen Intern Med 2017 Jun;32(6):626-631 [FREE Full text] [doi: 10.1007/s11606-017-3982-5] [Medline: 28150098] 2. Emmert M, Sander U, Pisch F. Eight questions about physician-rating websites: a systematic review. J Med Internet Res 2013 Feb 01;15(2):e24 [FREE Full text] [doi: 10.2196/jmir.2360] [Medline: 23372115] 3. Xu Y, Armony M, Ghose A. The interplay between online reviews and physician demand: an empirical investigation. Manag Sci 2021 Dec;67(12):7344-7361. [doi: 10.1287/mnsc.2020.3879] 4. Hu N, Liu L, Sambamurthy V. Fraud detection in online consumer reviews. Decis Support Syst 2011 Feb;50(3):614-626. [doi: 10.1016/j.dss.2010.08.012] 5. Lantzy S, Anderson D. Can consumers use online reviews to avoid unsuitable doctors? Evidence from rateMDs.com and the Federation of State Medical Boards. Decis Sci 2020 Aug;51(4):962-984. [doi: 10.1111/deci.12398] 6. Lieber R. The Web Is Awash in Reviews, but Not for Doctors. Here’s Why. The New York Times. 2012 Mar 9. URL: https://www.nytimes.com/2012/03/10/your-money/why-the-web-lacks-authoritative-reviews-of-doctors.html [accessed 2021-10-01] 7. McGrath RJ, Priestley JL, Zhou Y, Culligan PJ. The validity of online patient ratings of physicians: analysis of physician peer reviews and patient ratings. Interact J Med Res 2018 Apr 09;7(1):e8 [FREE Full text] [doi: 10.2196/ijmr.9350] [Medline: 29631992] 8. Dunivin Z, Zadunayski L, Baskota U, Siek K, Mankoff J. Gender, soft skills, and patient experience in online physician reviews: a large-scale text analysis. J Med Internet Res 2020 Jul 30;22(7):e14455 [FREE Full text] [doi: 10.2196/14455] [Medline: 32729844] https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 13 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al 9. Roter D, Lipkin Jr M, Korsgaard A. Sex differences in patients' and physicians' communication during primary care medical visits. Med Care 1991 Nov;29(11):1083-1093. [doi: 10.1097/00005650-199111000-00002] [Medline: 1943269] 10. Jefferson L, Bloor K, Birks Y, Hewitt C, Bland M. Effect of physicians' gender on communication and consultation length: a systematic review and meta-analysis. J Health Serv Res Policy 2013 Oct;18(4):242-248 [FREE Full text] [doi: 10.1177/1355819613486465] [Medline: 23897990] 11. Greenwood BN, Carnahan S, Huang L. Patient-physician gender concordance and increased mortality among female heart attack patients. Proc Natl Acad Sci U S A 2018 Aug 21;115(34):8569-8574 [FREE Full text] [doi: 10.1073/pnas.1800097115] [Medline: 30082406] 12. Hao H, Zhang K. The voice of Chinese health consumers: a text mining approach to Web-based physician reviews. J Med Internet Res 2016 May 10;18(5):e108 [FREE Full text] [doi: 10.2196/jmir.4430] [Medline: 27165558] 13. López A, Detz A, Ratanawongsa N, Sarkar U. What patients say about their doctors online: a qualitative content analysis. J Gen Intern Med 2012 Jun;27(6):685-692 [FREE Full text] [doi: 10.1007/s11606-011-1958-4] [Medline: 22215270] 14. Thawani A, Paul MJ, Sarkar U, Wallace BC. Are online reviews of physicians biased against female providers? In: Proceedings of the 4th Machine Learning for Healthcare Conference. 2019 Presented at: PMLR '19; August 8-10, 2019; Ann Arbor, MI, USA p. 406-423. 15. Marrero K, King E, Fingeret AL. Impact of surgeon gender on online physician reviews. J Surg Res 2020 Jan;245:510-515. [doi: 10.1016/j.jss.2019.07.047] [Medline: 31446193] 16. Kordzadeh N. Investigating bias in the online physician reviews published on healthcare organizations' websites. Decis Support Syst 2019 Mar;118:70-82. [doi: 10.1016/j.dss.2018.12.007] 17. Wallace BC, Paul MJ, Sarkar U, Trikalinos TA, Dredze M. A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. J Am Med Inform Assoc 2014;21(6):1098-1103 [FREE Full text] [doi: 10.1136/amiajnl-2014-002711] [Medline: 24918109] 18. Rivas R, Montazeri N, Le NX, Hristidis V. Automatic classification of online doctor reviews: evaluation of text classifier algorithms. J Med Internet Res 2018 Nov 12;20(11):e11141 [FREE Full text] [doi: 10.2196/11141] [Medline: 30425030] 19. Wartena C, Sander U, Patzelt C. Sentiment independent topic detection in rated hospital reviews. In: Proceedings of the 13th International Conference on Computational Semantics - Short Papers. 2019 Presented at: IWCS '19; May 23-27, 2019; Gothenburg, Sweden p. 59-64. [doi: 10.18653/v1/w19-0509] 20. Li S, Lee-Won RJ, McKnight J. Effects of online physician reviews and physician gender on perceptions of physician skills and Primary Care Physician (PCP) selection. Health Commun 2019 Oct;34(11):1250-1258. [doi: 10.1080/10410236.2018.1475192] [Medline: 29792519] 21. Physician Data Center Query. Federation of State Medical Boards. 2018. URL: https://www.fsmb.org/PDC/ [accessed 2022-08-26] 22. Manning CD, Schutze H. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press; 1999. 23. Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 2004;5(Apr):361-397. 24. Rijsbergen CJ, Robertson SE, Porter MF. New Models in Probabilistic Information Retrieval. Vol. 5587. London, UK: British Library Research and Development Department; 1980. 25. Porter MF. An algorithm for suffix stripping. Program 1980;14(3):130-137. [doi: 10.1108/eb046814] 26. Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning. 2014 Presented at: PMLR '14; June 21-26, 2014; Beijing, China p. 1188-1196. 27. Mohammad SM, Turney PD. Crowdsourcing a word-emotion association lexicon. Comput Intell 2013 Aug;29(3):436-465. [doi: 10.1111/j.1467-8640.2012.00460.x] 28. Lau JH, Baldwin T. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv. Preprint posted online July 19, 2016 [FREE Full text] [doi: 10.48550/ARXIV.1607.05368] 29. MacNell L, Driscoll A, Hunt AN. What’s in a name: exposing gender bias in student ratings of teaching. Innov High Educ 2014 Dec 5;40(4):291-303. [doi: 10.1007/s10755-014-9313-4] 30. Castilla EJ. Gender, race, and meritocracy in organizational careers. AJS 2008 May;113(6):1479-1526. [doi: 10.1086/588738] [Medline: 19044141] 31. Peterson DA, Biederman LA, Andersen D, Ditonto TM, Roe K. Mitigating gender bias in student evaluations of teaching. PLoS One 2019 May 15;14(5):e0216241 [FREE Full text] [doi: 10.1371/journal.pone.0216241] [Medline: 31091292] 32. Uhlmann EL, Cohen GL. Constructed criteria: redefining merit to justify discrimination. Psychol Sci 2005 Jun;16(6):474-480. [doi: 10.1111/j.0956-7976.2005.01559.x] [Medline: 15943674] Abbreviations LDA: latent Dirichlet allocation NRC: National Research Council Canada OBGYN: obstetrics and gynecology https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 14 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Edited by A Mavragani; submitted 17.11.21; peer-reviewed by R McGrath, J Fan, H Stevens; comments to author 24.01.22; revised version received 26.04.22; accepted 17.07.22; published 08.09.22 Please cite as: Barnett J, Bjarnadóttir MV, Anderson D, Chen C Understanding Gender Biases and Differences in Web-Based Reviews of Sanctioned Physicians Through a Machine Learning Approach: Mixed Methods Study JMIR Form Res 2022;6(9):e34902 URL: https://formative.jmir.org/2022/9/e34902 doi: 10.2196/34902 PMID: ©Julia Barnett, Margrét Vilborg Bjarnadóttir, David Anderson, Chong Chen. Originally published in JMIR Formative Research (https://formative.jmir.org), 08.09.2022. This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 15 (page number not for citation purposes) XSL FO RenderX http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png JMIR Formative Research JMIR Publications

Understanding Gender Biases and Differences in Web-Based Reviews of Sanctioned Physicians Through a Machine Learning Approach: Mixed Methods Study

Loading next page...
 
/lp/jmir-publications/understanding-gender-biases-and-differences-in-web-based-reviews-of-rxpMpcQPta

References (39)

Publisher
JMIR Publications
Copyright
Copyright © The Author(s). Licensed under Creative Commons Attribution cc-by 4.0
ISSN
2561-326X
DOI
10.2196/34902
Publisher site
See Article on Publisher Site

Abstract

Background: Previous studies have highlighted gender differences in web-based physician reviews; however, so far, no study has linked web-based ratings with quality of care. Objective: We compared a consumer-generated measure of physician quality (web-based ratings) with a clinical quality outcome (sanctions for malpractice or improper behavior) to understand how patients’ perceptions and evaluations of physicians differ based on the physician’s gender. Methods: We used data from a large web-based physician review website and the Federation of State Medical Boards. We implemented paragraph vector methods to identify words that are specific to and indicative of separate groups of physicians. Then, we enriched these findings by using the National Research Council Canada word-emotion association lexicon to assign emotional scores to reviews for different subpopulations according to gender, gender and sanction, and gender and rating. Results: We found statistically significant differences in the sentiment and emotion of reviews between male and female physicians. Numerical ratings are lower and sentiment in text reviews is more negative for women who will be sanctioned than for men who will be sanctioned; sanctioned male physicians are still associated with positive reviews. Conclusions: Given the growing impact of web-based reviews on demand for physician services, understanding the different dynamics of reviews for male and female physicians is important for consumers and platform architects who may revisit their platform design. (JMIR Form Res 2022;6(9):e34902) doi: 10.2196/34902 KEYWORDS gender; natural language processing; web-based reviews; physician ratings by customer; text mining to help them decide which physician to see [1]. Although ratings Introduction are widely used, they are also sparse, with many physicians having only 1 or 2 ratings on any given site. In addition, up to Background 90% of all web-based reviews are positive [2]. Positive reviews Web-based reviews of physicians play an important role in have been estimated to increase physician demand by as much patients’ searches for providers. A 2017 National Institutes of as 7% [3]. Thus, to fully understand the impact of web-based Health survey found that 39% of adults used web-based reviews https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 1 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al reviews on individual providers and the health care system as attacks treated by male physicians had significantly higher a whole, it is important to identify when and why physicians mortality than female patients treated by female physicians. are given negative reviews and whether those negative reviews Web-based physician reviews have been studied from diverse actually reflect reality. angles. We divided our summary of the literature into review There are many reasons to believe that web-based reviews may content, how reviews correlate with peer ratings, fraudulent be unhelpful. For example, they are subject to review fraud, as reviews, sentiment analysis, and finally, the impact of physician Hu et al [4] estimated that 10% of all web-based reviews are reviews. fraudulent. In addition, they are typically nonexpert reviews of Content of Physician Reviews expert services; that is, credence services, and customer reviews Hao and Zhang [12] implemented latent Dirichlet allocation of credence services are generally thought to be unhelpful [5]. (LDA) as a topic modeling technique for textual review of data For both reasons, Dr Peter Carmel, president of the American about Chinese physicians in 4 major specialty areas and Medical Association, has argued that “anonymous online identified popular review topics, including professionalism and opinions of physicians should be taken with grain of salt and showing appreciation for physicians’ detailed symptom should not be a patient’s sole source of information when descriptions [12]. A second study identified patient satisfaction, looking for a new physician” [6]. Studies have further found staff, and access as important themes in the reviews studied that although web-based reviews matched peer evaluations of [13]. physicians in outpatient specialties, this was not the case for inpatient surgical specialties [7]. An extensive analysis of reviews from US health care review websites by Thawani et al [14] found that female physicians Gender adds another layer of complexity to questions related receive lower ratings overall, even after accounting for specialty. to the influence, validity, and value of web-based reviews. Comments about female physicians are more likely to be related Studies have shown that female physicians are given more to their interpersonal skills, whereas for male physicians, negative reviews and that women are rated as less amicable [8]. comments focus more on professionalism and helpfulness. Although previous studies have shown differences in review Marrero et al [15] further examined a subset of the same data content based on physician gender, our study adds a critical to understand the influence of gender on how patients both dimension by considering an external indicator of physician perceive and evaluate their surgeons, confirming that women quality. We used data from the Federation of State Medical are evaluated more positively for social interactions and men Boards on physicians who have been sanctioned by their state for technical aptitude. medical boards for unsuitability to practice medicine, either for negligence, malpractice, or other improper behavior. Sanctions Validity of Physician Reviews range from probation to complete revocation of the offending McGrath et al [7] examined the validity of patient-generated physician’s medical license. As receiving a sanction is an web-based physician reviews and found that validity is affected objective marker of low-quality medical care, at least for some by physician specialty. For specialties such as family medicine, physicians, looking at sanctions gives us a way to quantify allergies, internal medicine, and pediatrics, the web-based physician quality, which is a notoriously difficult task. We ratings of physicians listed as a top doctor by their peers are showed that women receive systematically different reviews significantly higher than the ratings of those without this from men and that female physicians who will be sanctioned peer-generated quality indicator. Kordzadeh [16] showed that in the future are rated lower and receive more negative the ratings listed on hospital websites are systematically higher comments in their reviews than similarly situated male than those on outside commercial physician rating sites such as physicians. RateMDs and Google Reviews. Related Literature Sentiment of Physician Reviews Physician’s Gender Wallace et al [17] developed a factorial LDA model to jointly Previous studies have explored gender differences in light of identify both sentiment and topics from reviews. By how physicians consult and communicate with their patients. incorporating the factorial LDA output into regression analysis, Studies have concluded that female physicians are generally they further found that positive sentiment is associated with more communicative and interpersonal than male physicians health care measurements such as patients’ revisit probability as they focus more on building partnership, asking questions, and health care costs and that the model can explain more and providing information, which results in long medical variance than models using only rating information. Similarly, appointments with female physicians [9]. This long consultation Rivas et al [18] developed a dependency tree–based classifier duration reduces the volume of consultations that female to capture patterns from each review, which can be used to sort physicians can provide [10]. Some studies have explored the physician reviews into a 2D classification system based on topic reasons for long consultations and the role of gender in medical and polarity. Waltena et al [19] focused on the impact of decision-making. For instance, when diagnosing coronary heart sentiment on topic extraction in hospital reviews, and by adding disease, female physicians are more engaged with the historical 2 topics representing positive and negative sentiment in latent presentation of the patient’s condition and more likely to be semantic analysis, the authors successfully reduced the bias affected by the patient’s gender than male physicians. owing to sentiment on the subjects of topics. Greenwood et al [11] showed that female patients who had heart https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 2 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al sanctioned physicians from the Federation of State Medical Impact of Physician Reviews Boards. We removed any reviews of physicians that were made The impact of web-based physician reviews on patient choice after they were sanctioned, so any official sanction does not remains an active area of research. Xu et al [3] explored the affect the content of the reviews. In total, we obtained 403,470 interaction between web-based physician reviews and physician reviews of 134,973 physicians across the United States. In our demand and concluded that the number of reviews and data, men were more than twice as likely as women to be disclosure of reviewer identity are positively related to physician sanctioned; 1.7% (1629/95,831) of all male physicians were demand but negatively correlated with review length. Through sanctioned, whereas only 0.64% (250/39,142) of female a counterfactual experiment, they found that strategies for physicians were sanctioned. improving ratings (eg, disclosing reviewers’ identities and limiting review length) can increase the demand for a physician The web-based reviews from RateMDs were merged with the by as much as 7.24%. However, improving the operational state medical board sanction data by matching physician name process or platform design can increase physician demand even (including matching using a dictionary of common nicknames further. Li et al [20] studied how web-based reviews and [eg, Kate for Katherine]), state, specialty, medical school, and physicians’ gender affect patients’ primary care physician graduation year (where available). The physicians in the sanction choices. The results indicated that among physicians whose data who we could not perfectly match owing to multiple skills are endorsed in reviews, if a female physician is endorsed matches or no matches (and which amounted to <5% of the for their interpersonal characteristics, such as compassion and sample) were excluded from the study. personableness, they are more likely to be chosen than a male Methodology physician endorsed for the same reasons. However, this kind of gender effect is not observed among physicians endorsed for Overview their technical skills. Bedside manner, diagnosis accuracy, The field of text mining and natural language processing is patients’ waiting time, and consultation length are critical in growing rapidly, with many emerging techniques available to patients’ choice of a physician [3]. analyze text and discover patterns in documents via automated Our study is the first to analyze the content of patient reviews procedures. In their book, Foundations of Statistical Natural of physicians across genders using natural language processing Language Processing, Manning and Schutze [22] stated that tools that accounts for differences in ratings and sanction status. the availability of large text corpora has changed the scientific This allowed us to understand both the set of criteria on which approach to language in linguistics and cognitive science. male and female physicians are evaluated and the impact of Therefore, phenomena that were previously undetectable or poor performance (as measured by sanctions). We further seemingly uninteresting have become the central focus of lexical applied an emotional index to understand, in a multidimensional analysis. Taking advantage of some of these new developments, way, the tones of the different types of reviews based on ratings, in this study, we implemented paragraph vector (as described gender, and sanction status. More specifically, in this study, we in the following sections) and used a word-emotion association aimed to determine whether reviews of women systematically lexicon on the corpus of physician reviews to analyze the data differ from those of men. In particular, we aimed to discover in a nuanced manner. whether female physicians are rated lower at baseline than male Data Preprocessing physicians and whether female physicians experience larger To make the raw data analyzable, we performed a series of reputational penalties than male physicians for low-quality tasks. First, the reviews were converted to lowercase, so that services (as indicated by sanctions from the state medical board). capital letters are treated the same as lowercase letters. Second, punctuation was removed because it typically adds unnecessary Methods noise to word models. Third, stopwords, defined as unimportant Data words that are overly common (eg, “the,” “and,” and “is”) were removed using a freely available System for the Mechanical Our data were collected from 2 sources: physician reviews were Analysis and Retrieval of Text stopword list built by Salton and obtained from RateMDs and combined with physician sanction Buckley and sourced from web-based Appendix 11 of the paper data from the Federation of State Medical Boards [21]. by Lewis et al [23]. Fourth, we removed numbers because, The data from RateMDs include physicians’ average ratings on similar to punctuation, they add noise to the analysis. On the a 1- to 5-star scale. Reviewers rate the overall experience and remaining words in the corpus, we performed stemming using 4 other defined categories: helpfulness, knowledgeability, the Porter stemming algorithm [24,25]. Stemming is the act of punctuality, and staff. The data further contain the text of the reducing words to their root form (eg, “practice,” “practicing,” reviews. and “practiced” become “practic”). This allows models to treat these words as one concept rather than as separate ideas. As we State licensing boards issue sanctions to physicians for issues had a limited-sized data set, we applied all the preprocessing related to their suitability to practice medicine in each state. steps mentioned previously to maximize insights from a concise Reasons for sanctions include, but are not limited to, serious vocabulary. Although the removal of stopwords resulted in malpractice, performing unnecessary treatment, fraudulent some locally unnatural word sequences (such as articles not billing, and abuse of patients. We collected every review posted appearing before nouns), we found that this did not hinder our between October 2004 and August 2011 and matched it by analysis. name, location, state, and specialty with the database of https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 3 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al differences between the subsets. For example, we computed the Analytical Techniques similarity of wait to the female corpus of reviews, computed In this study, we applied a paragraph vector framework [26], a the similarity of wait to the male corpus of reviews, and natural language processing method that represents each word calculated the difference in these scores. Our analysis focused or document as a dense vector (ie, a location in space), on the words with the highest absolute difference between the called an embedding, which is then used as an input to train a similarity scores for one subset of reviews (typically female) model to predict co-occurrence of words. We used the paragraph and the complementary subset (typically male). vector distributed bag-of-words model, which uses words from To understand the emotional nature of the reviews, we used the a given width window to predict the next word in the document. NRC word-emotion association lexicon [27] to attribute In this framework, “kind” is located closer to “nice” than sentiment and emotional scores to the corpus (NRC stands for “surgery” because “nice” has a much higher probability than the National Research Council Canada, but the lexicon is “surgery” of being found in similar contexts as “kind.” We used commonly referred to as the NRC emotion lexicon). This lexicon the paragraph vector model to generate an embedding of words, created an afinn dictionary by rating words on a scale of 8 which can be used to calculate the similarity (via cosine emotions: anger, anticipation, disgust, fear, joy, sadness, similarity) between any set of words or documents. Henceforth, surprise, and trust. Using the scores from this lexicon, we were we refer to the cosine similarity between words or documents able to both rate reviews on an aggregate emotional scale (how as the similarity score. emotional the document is as a whole) and rank them for each For each data slice (eg, sanctioned physicians), we trained a of the 8 emotions. More specifically, for each data cut (eg, paragraph vector model. Once the model was trained, we could sanctioned female physicians), each word in each physician’s use the embedding to identify words associated with the medical review was scored based on the emotional score of the word, reviews of different types of physicians (eg, based on gender). and then, average physician score was derived by averaging all To compare specific differences across a physician population, physicians’ emotional scores. Understanding these emotional we concatenated every review from one specific subset of data scores allowed us to develop a deep understanding of the criteria (eg, sanctioned male physicians) and found the similarity scores that patients use to evaluate female and male physicians and of this document with each word within the corpus. Then, we how those criteria differ. repeated this process for the complementary subset (eg, We have summarized the methodological approach in Figure sanctioned female physicians) and compared the similarity scores for each subset. We extracted the words with the largest Figure 1. Analysis flowchart. NRC: National Research Council Canada; OBGYN: obstetrics and gynecology. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 4 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al have listed the exact parameter settings in Multimedia Appendix Implementation and Hypertuning 1 and summarized the results of the paragraph vector model for We used doc2vec [28], a Python package implementing “knowledgeable,” “wonderful,” “caring,” and “rude” in Figure paragraph vectors, to learn how patients review physicians S2 in Multimedia Appendix 1. differently across gender, sanctions, and ratings (both in isolation and interaction). We performed a standard doc2vec Ethical Considerations implementation to learn the paragraph vectors of the following: All the data used in this study are publicly available and do not Gender (male and female) contain identifiable private information about individuals. Thus, Composite label of gender and sanction (female sanctioned, this study was not deemed to require institutional review board female unsanctioned, male sanctioned, and male review. After merging sanction data with review data by name, unsanctioned) specific physician identities were removed from the data set Composite label of gender and rating (female high rating, and not used in the analyses. female low to medium rating, male high rating, and male low to medium rating), where we defined high rating as ≥4 Results stars and low to medium rating as <4 stars Data Overview To overcome majority bias, we sampled an equal number of Figure 2 shows the number of physicians in each specialty by reviews for each group. We trained the models independently gender. Internal medicine and family practice are the 2 most for the different metadata cuts, rather than treating each separate common specialties in our data. The figure highlights that there review as an individual document. By fitting the different groups are more male physicians than female physicians in every separately, we were able to understand the specific lexicons specialty; overall, 29% (39,142/134,973) of the physicians in associated with each metadata cut (gender, sanction, and rating). the sample are women. This gender imbalance is easily Then, we analyzed the similarity scores of words to their noticeable in the more common disciplines; internal medicine respective corpora and compared the scores. has the highest number of female physicians, but there are still We pretrained the paragraph vector framework, using the twice as many male physicians. The imbalance is even more continuous bag-of-words algorithm to tune the hyperparameters, prominent in some small disciplines such as orthopedic surgery by testing the most similar words to several words such as and neurological surgery, where men outnumber women 23:1 “knowledgeable,” “wonderful,” “caring,” and “rude.” We ran and 15:1, respectively. Obstetrics and gynecology (OBGYN) multiple variations of the model to identify the best settings. and pediatrics departments are more balanced in terms of gender, The results were consistent across different parameters, which with an approximately even ratio of men to women. gave us confidence in the robustness of the final model. We Figure 2. Number of physicians in each specialty, broken down by gender. Table 1 highlights the average star ratings overall and for each When the ratings of sanctioned male physicians are compared of the main categories present in the reviews (helpfulness, with those of sanctioned female physicians, the absolute knowledgeability, punctuality, and staff). This pattern is differences are of similar magnitude; however, owing to the consistent across specialties including internal medicine and small size of the sanctioned population, the differences are not OBGYN. Furthermore, unsanctioned physicians receive higher statistically significant. Among sanctioned physicians, female ratings than sanctioned physicians. We note that the staff physicians receive lower ratings (by an average of approximately category ratings may not be reflective of the physician’s medical 0.1 stars) than male physicians (not considering specialties). capabilities. In all cases, the average rating for men is higher The difference between genders among sanctioned physicians than that for women. These differences are statistically is greater than that among unsanctioned physicians, especially significant (P<.001, evaluated with 2-tailed t tests) both when for those rated around average for helpfulness and comparing unsanctioned male physicians with unsanctioned knowledgeability. A detailed breakdown of the number of female physicians and when comparing unsanctioned female sanctioned physicians is provided in Table S1 in Multimedia or male physicians with sanctioned female or male physicians. Appendix 1. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 5 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Table 1. Average star rating (out of 5 stars) overall and for the 4 RateMDs score categories for the whole sample of physicians. Ratings are separated by gender and sanction status. Categories Full sample (n=134,973), Female physicians (n=39,142, 29%), mean (SD) Male physicians (n=95,831, 71%), mean (SD) mean (SD) Unsanctioned Sanctioned (n=250, Unsanctioned Sanctioned (n=1629, (n=38,892, 99.36%) 0.64%) (n=94,202, 98.30%) 1.69%) Overall 3.86 (1.12) 3.81 (1.12) 3.41 (1.21) 3.89 (1.12) 3.52 (1.24) Helpfulness 3.89 (1.36) 3.85 (1.35) 3.47 (1.44) 3.90 (1.36) 3.59 (1.48) Knowledgeability 4.03 (1.25) 3.99 (1.24) 3.62 (1.35) 4.06 (1.25) 3.75 (1.38) Punctuality 3.83 (1.18) 3.77 (1.18) 3.39 (1.27) 3.86 (1.18) 3.43 (1.31) Staff 3.67 (1.3) 3.61 (1.3) 3.02 (1.41) 3.70 (1.3) 3.19 (1.45) On average, each physician receives 3 reviews, with an average However, on the whole and across specialties, the group with length of 55.7 (SD 47.65) words. In general, lower-ranked the longest reviews are the low to medium–ranked women. physicians receive longer reviews than higher-ranked physicians; Given what we know from our subsequent content analysis, this people have more to say about an experience they are is because patients have longer and more negative comments dissatisfied with. On average, women receive longer reviews to make about women, whereas reviews of male physicians are than men. An exception is in the OBGYN field—in this short and more positive. As highlighted in Table 2, these trends specialty, patients have more to say about a sanctioned male hold when we break down the length analysis by specialty. physician than they do about a sanctioned female physician. Table 2. Average review length for sanctioned and unsanctioned male physicians and female physicians in all specialties, internal medicine, and OBGYN , measured in number of words. Categories All specialties (n=134,973) Internal medicine (n=33,549) OBGYN (n=15,001) Female Male Female (n=9087, 27.09%), Male Female (n=7268, 48.45%), Male (n=39,142, (n=95,831, mean (SD) (n=24,462, mean (SD) (n=7733, 29%), mean 71%), mean 72.91%), mean 51.55%), (SD) (SD) (SD) mean (SD) Overall 50.1 (36.4) 45.7 (36.1) 45.8 (36.3) 41.5 (35.2) 58 (34.6) 54.5 (34.9) Sanctioned 48.7 (36.6) 47.6 (37.1) 55.1 (34) 42.9 (35.8) 46 (30.8) 51.2 (38.7) Unsanctioned 50.2 (36.4) 45.6 (36.1) 45.7 (36.3) 41.4 (35.2) 58.2 (34.7) 54.5 (34.9) High rating 39 (31) 37.2 (31.3) 35.9 (31) 33.4 (29.8) 45.4 (29.6) 47.8 (32) Low to medium rat- 61.6 (38) 56.3 (38.8) 57.3 (38.5) 53 (38.9) 69.7 (34.8) 65 (36.7) ing OBGYN: obstetrics and gynecology. The nature of physicians’ work differs between specialties, highly ranked reviews). For each analysis (eg, male physician which in turn may influence web-based reviews. Therefore, to vs female physician in highly ranked reviews), we extracted the remove the impact of specialty, our analysis in this study focused top words by similarity score to the paragraph vector of on internal medicine (the most common type of physician concatenated female reviews and concatenated male reviews, reviewed). In addition, we conducted the analysis on OBGYN respectively, in the relevant subset of data, and then compared reviews and compared the results with those for internal the differences. Our analysis focused on the words with the medicine (detailed results for the OBGYN reviews are available greatest absolute difference between the similarity scores for in Multimedia Appendix 1). This allowed us to compare results female and male reviews. Additional and complementary results across medical specialties, but the OBGYN results are are available in Multimedia Appendix 1. particularly interesting, as we can be confident that most reviews Review Comparison Between Genders are written by women, giving us further insight into the To examine the relative similarity scores of the words used in differences in results. Following these analyses, we compared the corpus to describe men and women, we extracted the top the length and emotion of the reviews. words by similarity score (omitting procedural-type words, eg, We examined the differences between gender and reviews in 3 “appt” and “said” for analysis purposes) for the subset of male ways: first, we analyzed male and female physicians; second, physician reviews and female physician reviews, as summarized we studied both gender and rating; and third, we analyzed the in Figure 3. This figure presents the top 15 words with the interaction of sanction and gender. For each of these analyses, largest difference between similarity scores to the document a separate doc2vec model was trained on the relevant corpus vector for concatenated female reviews and concatenated male (eg, the entire corpus, reviews of sanctioned physicians, or https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 6 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al reviews, with the left pane showing the 15 words that scored evaluated, which supports the findings of previous analyses highest for the female reviews and the right pane showing those [14-16]. From this general comparison, without considering that scored highest for male reviews. rating, we see that women are more often evaluated with a focus on punctuality, whereas men are much more likely to be praised This led us to several interesting observations. For example, for their technical abilities and bedside manner. To confirm that “assist,” “neg,” “difficult,” “wait,” “punctual,” and “issue” were these underlying frequencies are statistically significant, we scored as more similar to female physicians’ reviews than male performed a chi-square test for the top words presented in Figure physicians’ reviews. In contrast, “superb,” “gentl,” “famil,” 3 (the null hypothesis was that there is no difference in these “skill,” “humor,” and “great” were scored as more similar to frequencies, and the alternative hypothesis was that there are male physicians’ reviews. The key takeaway is that even before differences in these frequencies not equal to 0) and found all of we incorporate sanction and ranking data, we see stark them to be significant (P<.001). differences between the ways male and female physicians are Figure 3. Difference in similarity scores for top words in reviews of male and female internal medicine physicians. The x-axis represents the absolute difference in similarity score for the given words to the document vector of concatenated reviews for all women and all men. The figure displays the top 15 words; the biggest differences in similarity scores are for the female subset of reviews over male reviews (left pane) and the male subset of reviews over female reviews (right pane). roles (eg, “assist” and “staff”). In contrast, the corpora of male Review Comparison Between Gender and Rating physicians’ reviews are more likely to contain words that are We used the approximate mean as the standard criterion for medically technical (eg, “hospit,” “cardiologist,” “skill,” or determining whether a rating was high (>4 stars) or low to “diagnostician”) or simply glowing endorsements (eg, medium (≤4 stars). Then, we created document vectors for the “brilliant,” “superb,” and “greatest”). These findings are following subsets of concatenated reviews: (1) reviews rating summarized in Figure 4A. female physicians highly, (2) reviews rating male physicians Despite these discrepancies, we note that highly ranked highly, (3) reviews rating female physicians as medium to low, physicians generally garner positive text reviews regardless of and (4) reviews rating male physicians as medium to low. We gender. Gender differences become much more pronounced repeated this analysis, focusing first on high-ranked female and when focusing on low-ranked physicians. As summarized in male physicians and second on low to medium–ranked female Figure 4B, the words with the highest similarity scores for and male physicians. The results are shown in Figure 4. We reviews for low to medium–ranked women are objectively much again compared the top words by absolute difference in more negative (eg, “unprofession,” “cold,” “issu,” “dismiss,” similarity score between men and women within the high and “notveri”) compared with the reviews of low to reviews first and then within the low to medium reviews. medium–ranked men (eg, “skill,” “sens,” “famili,” “humor,” For highly ranked physicians, the words that are the most “great,” and “excel”). The only objectively negative word that associated with female physicians’ reviews over the corpora or is much more likely to occur in these male physicians’ reviews male physicians’ reviews tend to either describe the timeliness is “arrog” for arrogance (a quality more often attributed to men of the visit (eg, “wait” and “rush”), liken female physicians to than to women). workers in supporting roles, or evaluate staff in those supporting https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 7 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 4. Difference in similarity scores for top words for (A) high-ranked and (B) low to medium–ranked men and women in internal medicine. The x-axis represents the absolute difference in similarity score for the given words to the document vector of concatenated reviews for all (A) high-ranked women and all men and (B) low to medium–ranked women and all men. The figure displays the top 15 words with the biggest differences in similarity scores for the female subset of reviews over male reviews (left pane) and the male subset of reviews over female reviews (right pane). sanctioned men’s reviews, whereas it is much more difficult to Review Comparison Between Gender and Sanction tell the difference between a sanctioned man and an As discussed previously, male physicians receive high ratings unsanctioned man. Some of the words most highly associated on average, but at the same time are more likely to be with sanctioned male physicians are “specialti,” “gentl,” sanctioned. This motivated our independent analysis of reviews “helpful,” “thank,” “skill,” and “god,” whereas some of the of sanctioned and unsanctioned physicians by gender. Owing highest scored words for sanctioned female physicians are to the low overall probability of sanctions (1879/134,973, 1.39% “receptionist,” “unprofession,” “pa,” “wait,” and “notveri.” of our sample), the reviews of unsanctioned physicians mirror Words that are exclusive to the sanctioned male lexicon include the general discrepancies between men and women. In contrast, “cardiologist,” “save,” “heart,” “hospit,” “superb,” “pleasur,” the analysis of sanctioned physicians’ reviews reveals stark and “compassion,” which highlight the stark discrepancies even gender differences, as highlighted in Figure 5. The words with further because these words do not appear even once in a the highest probability of appearing in sanctioned women’s sanctioned female physician’s review (additional details are reviews have much more negative connotations than those in available in Figures S4 and S5 in Multimedia Appendix 1). https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 8 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 5. Difference in similarity scores for top words for sanctioned men and women in internal medicine. The x-axis represents the absolute difference in similarity score for the given words to the document vector of concatenated reviews for all sanctioned women and all sanctioned men. The figure displays the top 15 words with the biggest differences in similarity scores for the female subset of reviews over male reviews (left pane) and the male subset of reviews over female reviews (right pane). When using gender (rather than rating) as the main dimension Emotion Scoring of analysis, we found that for internal medicine, reviews of men We analyzed the emotional scores of the reviews, generating are much more emotional than those of women, for both positive an emotional score for each subset of physicians. We used the and negative emotions, as demonstrated in Figure 7. A notable percentage of each top word appearing in each cut of the review exception is that women’s reviews scored high on negative corpus multiplied by the emotional score, repeated the process emotion. For OBGYN physicians (reviews that we can safely for each word in the lexicon to obtain a total score for each cut assume to be written mostly by women), the reviews are much (eg, male, female, sanctioned women, and highly rated men), more positive for men (overindexing on joy, positive, and trust), and then summed these scores within each subset. The emotion and the reviews of female physicians score notably high on analysis is the only portion of this study in which we found anticipation, disgust, negativity, and sadness. noticeable differences between the 2 specialties Next, we divided the analysis by gender and specialty, and then analyzed—internal medicine and OBGYN. Therefore, we have focused on the difference between sanctioned and unsanctioned included the results for both specialties in the main text. physicians. The results are highlighted in Figures 8A and 8B. In the plots below, the emotions are categorized as positive, For female internal medicine physicians, the results are negative, or neutral and listed alphabetically within each consistent with expectations; unsanctioned physicians score category in the following order: joy, positive, trust, anticipation, high on positive emotions, whereas sanctioned physicians score surprise, anger, disgust, fear, negative, and sadness. high on neutral and negative emotions. In contrast, for male internists, unsanctioned physicians score high across the First, we examined the differences in emotional scores between emotional scale (however, the differences are generally small). high-ranked and low to medium–ranked female physicians The pattern for OBGYN physicians is very different—among (Figure 6A) and between high-ranked and low to female OBGYN physicians, there is great variability in the medium–ranked male physicians (Figure 6B). As expected, emotional scores, whereas among male OBGYN physicians, more positive emotions are much more likely to be found in unsanctioned physicians score high on positive and neutral high ratings of both men and women, with only small differences emotions, with very little difference in emotional scores on between men and women in both specialties analyzed. negative emotions. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 9 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 6. Emotional score ratings for (A) female physicians’ and (B) male physicians’ reviews. The 10 emotions on the y-axis are categorized as positive, neutral, or negative (and arranged alphabetically within these categories). The x-axis plots the difference in the emotional score between the different groups. Positive numbers mean that an emotion scored high for high-ranked physicians, and negative numbers mean the emotion scored high for low to medium–ranked physicians. OBGYN: obstetrics and gynecology. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 10 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Figure 7. Emotional scores by gender for internal medicine and obstetrics and gynecology (OBGYN). Figure 8. Emotional scores by sanction status for (A) female physicians and (B) male physicians for both internal medicine and obstetrics and gynecology (OBGYN). https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 11 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Finally, we focused on the differences between sanctioned men word-emotion association lexicon to assign emotional scores and sanctioned women and summarize the results in Figure S18 to 3 segments: gender, gender and sanction, and gender and in Multimedia Appendix 1. In contrast to Figure 8, where we rating. hold the gender constant and analyze across sanction statuses, Principal Findings in Figure S18 in Multimedia Appendix 1, we hold the sanction Our findings shed light on the different criteria by which patients status constant and analyze across genders. This shows the evaluate male and female physicians, and they highlight the differences in reviews for both sanctioned women and men and disparity in severity with which patients review male and female unsanctioned women and men. For both specialties, we found physicians. When we analyze the ratings of male and female small gender differences among unsanctioned physicians. physicians while holding the rating range constant, it becomes However, among sanctioned physicians, the differences are clear that women are more likely to be evaluated on their large: women, especially female internists, are interpersonal bedside manner, whereas men are more likely to disproportionately reviewed in a more negative manner. Patients be evaluated based on their perceived technical skills and in the OBGYN department tend to review sanctioned male performance. This pattern holds when analyzing reviews of physicians more emotionally, in both positive and negative low-rated or medium-rated male physicians—the lexical content terms (with the exception of disgust and negativity, for which of their reviews is still much more likely to convey high praise, sanctioned female OBGYN physicians score high on average). whereas women are much more likely to be severely criticized. Overall, we conclude that emotional scoring analysis adds a The dynamic is further exacerbated among men and women layer of depth to our understanding of the differences among who are sanctioned. It is much more difficult to discern a review lexical reviews of physicians. The differences between the of a sanctioned man from the review of an unsanctioned man specialties are even more fascinating—although we are unable by the content of the written review alone, whereas for women, to discern major differences between the specialties regarding there is a stark contrast, and female physicians are evaluated general word composition of the lexicons, the emotional much more harshly if they are sanctioned. The insight gained discrepancies between internal medicine reviews (written by a by analyzing sanctioned physicians is an important contribution mix of patients) and OBGYN reviews (written by mostly female of this study. There are baseline differences between how male patients) are extremely clear. Holding everything else constant, and female physicians are perceived, but those differences are internal medicine reviews of male physicians tend to be largely greatly magnified when the service quality is low. Sanctioned more emotional, regardless of whether that emotion is positive men still receive glowing reviews, whereas sanctioned women or negative. Reviews of female OBGYN physicians tend to be experience large reputational penalties when they deliver much more negative. low-quality care or behave inappropriately. When we add sanction status to the analysis, the dynamic It is essential to understand not only the quantitative differences becomes more complex. In internal medicine, there are no in how and why female and male physicians are evaluated but notable differences in the emotional scores of sanctioned men also the qualitative aspect of those differences. Contributing to and unsanctioned men. In contrast, reviews of sanctioned female this qualitative understanding, our findings elucidate the physicians in internal medicine show negative emotion more gender-driven difference in bases for evaluations of physicians prominently than those of unsanctioned female physicians. The by patients. Most notably, we did not see differences in the biggest difference between emotional scores in this entire emotional language used for sanctioned and unsanctioned male analysis is between unsanctioned and sanctioned male OBGYN physicians, whereas female physicians who will be sanctioned physicians—sanctioned male OBGYN physicians receive the have consistently more negative emotion associated with their most negative reviews in any subset of data analyzed. When reviews. comparing sanctioned women directly with sanctioned men, Comparison With Previous Studies sanctioned female internal medicine physicians are reviewed much more negatively than sanctioned male internal medicine An expanding stream of literature shows significant gender bias physicians, but reviews of sanctioned male OBGYN physicians in ratings, perhaps most egregiously in a case in which changing are more emotional overall, regardless of whether the emotion the name of an anonymous teaching assistant from male to is positive or negative. female lowered the average review score [29]. Our study contributes to the growing literature on how web-based medical Discussion reviews are biased by gender, highlighting that in web-based reviews, women are more likely to receive negative reviews, Overview obtain low scores, and be judged on criteria not directly related In this study, we analyzed web-based reviews of physicians and to their skills as a physician (eg, diagnostic abilities) [20,21]. how they differ based on physicians’ gender. We further sought We make a unique contribution by examining how physicians to understand the complex interaction among the physician’s who are sanctioned for inappropriate behavior, negligence, or web-based score (rating), whether they are sanctioned by a state malpractice are penalized for low-quality service. medical board, and gender, as revealed in the content of the Limitations web-based reviews. To investigate this interaction, we Our results are subject to a few limitations imposed by the data. implemented paragraph vector techniques to identify words that First, we only have review data and do not know the actual are specific to and indicative of the separate metadata cuts. quality of care delivered (except care by sanctioned physicians, Then, we enriched these findings by using the NRC https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 12 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al which we know is more likely to be poor). We do not know the if there have been sociological changes in patients’ views and types of services received, and we do not know the patient behavior related to physician’s gender, our results will not outcomes. We tried to account for these unknowns by averaging capture those recent developments. all patient reviews for each physician and comparing physicians Conclusions within subspecialties, which should control for much of the The role and influence of web-based reviews may grow as variation in services provided. However, if the medical services medicine becomes increasingly computerized, a shift that has provided within a subspecialty systematically differ between only been accelerated by the COVID-19 pandemic. As genders, there may still be some residual confounding. Second, telemedicine expands in scope and prevalence, proximity our data set does not contain physicians’ race or ethnicity, which becomes less of a limiting factor in selecting a physician; is another potential dimension of review bias. Future studies therefore, patients will rely more on web-based reviews to guide can investigate the possibility of racial and gender bias. Third, their physician choices. Given this growing role of reviews in in our data, sanctioned physicians’ reviews before the sanction physician selection, action needs to be taken to ensure that they date were combined; therefore, we could not explore the are fair and balanced. Although awareness is the first step, commonality or information signals provided by no-text reviews websites and apps that feature or contain physician reviews or by the length of individual reviews. Fourth, owing to the should also follow best practices for mitigating gender and racial small number of sanctioned physicians, we represented the bias in those reviews. For instance, as previous studies have presence or absence of sanctions with a binary indicator; shown, asking specific questions rather than providing however, sanction severity varies. Therefore, future studies can open-ended boxes for reviews can reduce bias [30]. Similarly, focus on sanction severity to provide a more detailed and highlighting the potential for unconscious bias [31] and nuanced analysis of reviews. Finally, we acknowledge that the providing a rubric for evaluations [32] can also help web-based data are a decade old at the time of publication, meaning that platforms to mitigate biases in physician reviews. Authors' Contributions The authors’ order of contribution was as follows: JB, CC, MVB, and DA. JB drafted the Methods and Results sections and conducted the final data and modeling analysis. CC conducted the initial model fitting and exploratory analysis. MVB and DA collaborated in the overall guidance and direction of the paper, acquired the data, and edited the manuscript. DA drafted the Introduction and Discussion sections. All the authors approved the final manuscript. Conflicts of Interest None declared. Multimedia Appendix 1 Additional results, charts, and tables. [PDF File (Adobe PDF File), 737 KB-Multimedia Appendix 1] References 1. Holliday AM, Kachalia A, Meyer GS, Sequist TD. Physician and patient views on public physician rating websites: a cross-sectional study. J Gen Intern Med 2017 Jun;32(6):626-631 [FREE Full text] [doi: 10.1007/s11606-017-3982-5] [Medline: 28150098] 2. Emmert M, Sander U, Pisch F. Eight questions about physician-rating websites: a systematic review. J Med Internet Res 2013 Feb 01;15(2):e24 [FREE Full text] [doi: 10.2196/jmir.2360] [Medline: 23372115] 3. Xu Y, Armony M, Ghose A. The interplay between online reviews and physician demand: an empirical investigation. Manag Sci 2021 Dec;67(12):7344-7361. [doi: 10.1287/mnsc.2020.3879] 4. Hu N, Liu L, Sambamurthy V. Fraud detection in online consumer reviews. Decis Support Syst 2011 Feb;50(3):614-626. [doi: 10.1016/j.dss.2010.08.012] 5. Lantzy S, Anderson D. Can consumers use online reviews to avoid unsuitable doctors? Evidence from rateMDs.com and the Federation of State Medical Boards. Decis Sci 2020 Aug;51(4):962-984. [doi: 10.1111/deci.12398] 6. Lieber R. The Web Is Awash in Reviews, but Not for Doctors. Here’s Why. The New York Times. 2012 Mar 9. URL: https://www.nytimes.com/2012/03/10/your-money/why-the-web-lacks-authoritative-reviews-of-doctors.html [accessed 2021-10-01] 7. McGrath RJ, Priestley JL, Zhou Y, Culligan PJ. The validity of online patient ratings of physicians: analysis of physician peer reviews and patient ratings. Interact J Med Res 2018 Apr 09;7(1):e8 [FREE Full text] [doi: 10.2196/ijmr.9350] [Medline: 29631992] 8. Dunivin Z, Zadunayski L, Baskota U, Siek K, Mankoff J. Gender, soft skills, and patient experience in online physician reviews: a large-scale text analysis. J Med Internet Res 2020 Jul 30;22(7):e14455 [FREE Full text] [doi: 10.2196/14455] [Medline: 32729844] https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 13 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al 9. Roter D, Lipkin Jr M, Korsgaard A. Sex differences in patients' and physicians' communication during primary care medical visits. Med Care 1991 Nov;29(11):1083-1093. [doi: 10.1097/00005650-199111000-00002] [Medline: 1943269] 10. Jefferson L, Bloor K, Birks Y, Hewitt C, Bland M. Effect of physicians' gender on communication and consultation length: a systematic review and meta-analysis. J Health Serv Res Policy 2013 Oct;18(4):242-248 [FREE Full text] [doi: 10.1177/1355819613486465] [Medline: 23897990] 11. Greenwood BN, Carnahan S, Huang L. Patient-physician gender concordance and increased mortality among female heart attack patients. Proc Natl Acad Sci U S A 2018 Aug 21;115(34):8569-8574 [FREE Full text] [doi: 10.1073/pnas.1800097115] [Medline: 30082406] 12. Hao H, Zhang K. The voice of Chinese health consumers: a text mining approach to Web-based physician reviews. J Med Internet Res 2016 May 10;18(5):e108 [FREE Full text] [doi: 10.2196/jmir.4430] [Medline: 27165558] 13. López A, Detz A, Ratanawongsa N, Sarkar U. What patients say about their doctors online: a qualitative content analysis. J Gen Intern Med 2012 Jun;27(6):685-692 [FREE Full text] [doi: 10.1007/s11606-011-1958-4] [Medline: 22215270] 14. Thawani A, Paul MJ, Sarkar U, Wallace BC. Are online reviews of physicians biased against female providers? In: Proceedings of the 4th Machine Learning for Healthcare Conference. 2019 Presented at: PMLR '19; August 8-10, 2019; Ann Arbor, MI, USA p. 406-423. 15. Marrero K, King E, Fingeret AL. Impact of surgeon gender on online physician reviews. J Surg Res 2020 Jan;245:510-515. [doi: 10.1016/j.jss.2019.07.047] [Medline: 31446193] 16. Kordzadeh N. Investigating bias in the online physician reviews published on healthcare organizations' websites. Decis Support Syst 2019 Mar;118:70-82. [doi: 10.1016/j.dss.2018.12.007] 17. Wallace BC, Paul MJ, Sarkar U, Trikalinos TA, Dredze M. A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. J Am Med Inform Assoc 2014;21(6):1098-1103 [FREE Full text] [doi: 10.1136/amiajnl-2014-002711] [Medline: 24918109] 18. Rivas R, Montazeri N, Le NX, Hristidis V. Automatic classification of online doctor reviews: evaluation of text classifier algorithms. J Med Internet Res 2018 Nov 12;20(11):e11141 [FREE Full text] [doi: 10.2196/11141] [Medline: 30425030] 19. Wartena C, Sander U, Patzelt C. Sentiment independent topic detection in rated hospital reviews. In: Proceedings of the 13th International Conference on Computational Semantics - Short Papers. 2019 Presented at: IWCS '19; May 23-27, 2019; Gothenburg, Sweden p. 59-64. [doi: 10.18653/v1/w19-0509] 20. Li S, Lee-Won RJ, McKnight J. Effects of online physician reviews and physician gender on perceptions of physician skills and Primary Care Physician (PCP) selection. Health Commun 2019 Oct;34(11):1250-1258. [doi: 10.1080/10410236.2018.1475192] [Medline: 29792519] 21. Physician Data Center Query. Federation of State Medical Boards. 2018. URL: https://www.fsmb.org/PDC/ [accessed 2022-08-26] 22. Manning CD, Schutze H. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press; 1999. 23. Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 2004;5(Apr):361-397. 24. Rijsbergen CJ, Robertson SE, Porter MF. New Models in Probabilistic Information Retrieval. Vol. 5587. London, UK: British Library Research and Development Department; 1980. 25. Porter MF. An algorithm for suffix stripping. Program 1980;14(3):130-137. [doi: 10.1108/eb046814] 26. Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning. 2014 Presented at: PMLR '14; June 21-26, 2014; Beijing, China p. 1188-1196. 27. Mohammad SM, Turney PD. Crowdsourcing a word-emotion association lexicon. Comput Intell 2013 Aug;29(3):436-465. [doi: 10.1111/j.1467-8640.2012.00460.x] 28. Lau JH, Baldwin T. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv. Preprint posted online July 19, 2016 [FREE Full text] [doi: 10.48550/ARXIV.1607.05368] 29. MacNell L, Driscoll A, Hunt AN. What’s in a name: exposing gender bias in student ratings of teaching. Innov High Educ 2014 Dec 5;40(4):291-303. [doi: 10.1007/s10755-014-9313-4] 30. Castilla EJ. Gender, race, and meritocracy in organizational careers. AJS 2008 May;113(6):1479-1526. [doi: 10.1086/588738] [Medline: 19044141] 31. Peterson DA, Biederman LA, Andersen D, Ditonto TM, Roe K. Mitigating gender bias in student evaluations of teaching. PLoS One 2019 May 15;14(5):e0216241 [FREE Full text] [doi: 10.1371/journal.pone.0216241] [Medline: 31091292] 32. Uhlmann EL, Cohen GL. Constructed criteria: redefining merit to justify discrimination. Psychol Sci 2005 Jun;16(6):474-480. [doi: 10.1111/j.0956-7976.2005.01559.x] [Medline: 15943674] Abbreviations LDA: latent Dirichlet allocation NRC: National Research Council Canada OBGYN: obstetrics and gynecology https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 14 (page number not for citation purposes) XSL FO RenderX JMIR FORMATIVE RESEARCH Barnett et al Edited by A Mavragani; submitted 17.11.21; peer-reviewed by R McGrath, J Fan, H Stevens; comments to author 24.01.22; revised version received 26.04.22; accepted 17.07.22; published 08.09.22 Please cite as: Barnett J, Bjarnadóttir MV, Anderson D, Chen C Understanding Gender Biases and Differences in Web-Based Reviews of Sanctioned Physicians Through a Machine Learning Approach: Mixed Methods Study JMIR Form Res 2022;6(9):e34902 URL: https://formative.jmir.org/2022/9/e34902 doi: 10.2196/34902 PMID: ©Julia Barnett, Margrét Vilborg Bjarnadóttir, David Anderson, Chong Chen. Originally published in JMIR Formative Research (https://formative.jmir.org), 08.09.2022. This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included. https://formative.jmir.org/2022/9/e34902 JMIR Form Res 2022 | vol. 6 | iss. 9 | e34902 | p. 15 (page number not for citation purposes) XSL FO RenderX

Journal

JMIR Formative ResearchJMIR Publications

Published: Sep 8, 2022

Keywords: gender; natural language processing; web-based reviews; physician ratings by customer; text mining

There are no references for this article.