Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Human-likeness assessment for the Uncanny Valley Hypothesis

Human-likeness assessment for the Uncanny Valley Hypothesis IntroductionThe Uncanny Valley Hypothesis (hereafter UVH) has been formulated by Masahiro Mori (see [1]). Mori hypothesises that when we present a subject with a series of different human-like models (including robots) certain models would trigger negative reactions (uneasiness, eeriness). As he claims these will be almost human-like characters. We may imagine models presented in order, from least human-like (like e.g. robotic arm) to the most human-like ones on the X axis. On the Y axis we would present affinity level. According to Mori’s suggestion we would observe a growing level of affinity as we move towards human-like models, but at a certain level of human likeness the level of affinity will rapidly get lower; this is the ‘valley’ (or as MacDorman and Norri Kageki call it ‘descent into eeriness’ – [1]).As such, UVH gained much attention in both research [for example, only the Frontiers in Psychology journal reports the following impact for the UVH Research Topic for the last 12 months: 13,967 total views, 11,517 article views, 1567 article downloads. http://journal.frontiersin.org/researchtopic/2385/the-uncanny-valley-hypothesis-and-beyond#impact (Access 24.03.2017)] and popular media [see e.g. 2004 “Review: ‘Polar Express’ a creepy ride Technology brilliant, but where’s the heart and soul?” by Paul Clinton for CNN.com (http://edition.cnn.com/2004/SHOWBIZ/Movies/11/10/review.polar.express/, Access 24.03.2017) or Stephanie Lay, ‘Uncanny valley: why we find human-like robots and dolls so creepy’, The Guardian, 2015, https://www.theguardian.com/commentisfree/2015/nov/13/robots-human-uncanny-valley (Access 24.03.2017)]. As [2] note the reason for this is the wide range of applicability and importance of this hypothesis for a wide range of disciplines, from social robotics to computer animation (In the aforementioned review of ‘The Polar Express’ Paul Clinton writes: ‘The overall artwork is remarkable, and the action sequences are inventive and emotionally gripping. […] But those human characters in the film come across as downright… well, creepy. So ‘The Polar Express’ is at best disconcerting, and at worst, a wee bit horrifying.’) and game design (see e.g. [3] and an overview in [4] and [5]).Kätsyri et al. [2] in their exhaustive review of different theoretical models of UVH and their up-to-date empirical support note that ‘it is surprising that empirical evidence for the UVH is still ambiguous if not non-existent’. Their main conclusion is the need of deep analysis of the hypothesis itself, which would include inspection of human likeness and affinity dimensions of the UVH. Especially as [2] notice, Mori ‘himself used anecdotal examples to characterize different degrees of human-likeness’ using industrial robots, toy robots, prosthetic and even puppets, corpses and zombies. Such a variety of proposed models may trigger various levels of evaluating human likeness when it comes to empirical studies. As [2] stress, some of these aforementioned models (while treated literally) may trigger extraneous factors during studies. Puppets may address aesthetics level of evaluation, while corpses and zombies may simply evoke strong negative emotions (see also interesting discussion presented in [6]).These observations lead us to our research problem. We wanted to check whether evaluation of degrees of human likeness (DOH) of a particular arbitrary chosen computer-generated model is unproblematic. While reviewing UVH-related literature, we may observe studies that simply use Mori’s examples or types of models directly inspired by Mori’s paper. What is also notable is that in many cases UVH research settings simply take certain sets of models (photos, computer-generated graphics and videos) and rank them from least to most human like in an arbitrary fashion. Often these models are also grouped into arbitrary categories. Let us consider the following examples. In ([7], p. 273) we readTo create the stimuli we started by choosing seven 3D computer characters that included ones similar to those suggested by Mori: battle robot, toy robot, mannequin, skeleton, zombie, and low- and high-quality man […]Similarly in ([8], p. 110), authors explain how they choose models for their research:Considering the questions that must be covered, we tried to choose a not very large number of characters to avoid a too long form. Criteria for choosing the characters were set in order to obtain a group that allowed us to cover all the questions above. The first criterion was the human likeness of each character. This feature is necessary because it is present in the uncanny valley graph (the horizontal axis). […] To cover the human likeness feature, we chosen characters intended to represent accurately a human […]What we want to verify in our studies is whether such a choice is so simple and may be done authoritatively. We want to check how difficult DOH assessment is for a given set of models and also how difficult it is to evaluate a given model as typical for a certain category, like e.g. as a robot or an android.The paper is structured as follows. In the next section we describe two studies we have designed, one where subjects assessed DOH for a set of computer-generated models and the second, where subjects were asked to point the most characteristic model for a given category. In what follows we report the results of these studies, and we discuss them as well as the potential consequences for designing UVH studies and future research.MethodsTwo studies for this research were conducted as on-line questionnaires with the use of Google Form tool. The data were collected from December 2016 to January 2017. The language of two questionnaires was Polish. Subjects were recruited via e-mail invitation (invitations for the first and second study were sent to separate groups, so that participants from the first study would not take part in the second one). In the first study 30 participants took part (15 women and 15 men; mean age of a participant was 26 years; 63% of subjects declared high, education and 36% declared higher education). The group characteristics for the second study resembles the one for Study 1. Fifty-one participants took part (28 women, 23 men, mean age 26 years; education: 43% high and 57% higher).For our research we have prepared 15 computer-rendered models. Models were retrieved from 3D characters banks (see http://tf3dm.com and https://www.mixamo.com/) and rendered with the use of the Unity environment (see https://unity3d.com/). All the models were chosen arbitrarily by the authors and then consulted with two designers experienced in game development. Models were grouped into five categories: robot, android, zombie, animated character and human. We intended to create categories that reflect growing human likeness of the models. The key features for dividing models into categories were visible facial features, detail level of the model (e.g. visible hands and fingers), overall style of the model. For example, models classified as robots do not have visible eyes and their hands and feet are not detailed. Also, joints of body parts are clearly visible, which give them very mechanical appearance.All models were presented front face, on the uniform grey and white background and were scaled to the same size. The set of models used in the research is presented in Figure 1 (models are grouped by three according to their assigned category; model numbers reflect their order of appearance in Study 1).Figure 1:Computer-generated models used for the research.Models are grouped by three according to their assigned category (from top left: robot, android, zombie, animated character and human). Model numbers reflect their order of appearance in Study 1.Models were embedded into two studies.Study 1As it was mentioned, the study was conducted on-line with the use of Google Forms. There were no time constraints on filling the questionnaire. At the beginning of the questionnaire information about the study and the instruction were displayed. In the main part of the study each model appears on the screen separately one by one. Models are presented in random order. After a model is presented to a subject she/he is asked to assess the degree of human likeness of the model on a scale of 1–5, where (1)=Completely not human like; (2)=Rather not human-like; (3)=It starts to look human like; (4)=Rather human-like; and (5)=Completely human-like. The questionnaire ended with a short section asking for age and education of a subject. [Anonymous reviewer for this journal suggested that it would be beneficial for the study to control also the type of education (technical/non-technical) and interest in computer games. We agree that these may be important factors influencing the results of our study. We plan to include such questions in our future studies, as we discuss in the summary.](H1) Our hypothesis here was that we should obtain uniform results that will allow for unproblematic ordering of our arbitrary models from least to most human-like character. That is, we were expecting results that will allow for constituting X axis (human likeness) for UVH research.(H2) What is more, we were expecting that the results should reflect our choices of categories for the models set.Study 2For the second study we have also used an on-line questionnaire in Google Forms. There were no time constraints on filling the questionnaire. After the introductory section covering information about the study and instructions, subjects were presented with models in groups of three, reflecting the pre-established categories of robots, androids, zombies, animated characters and humans. Groups were presented one by one in random order. After each group a subject was asked to decide which model is the most typical representative of the group. For each group of models its name was clearly visible. The questionnaire ended with a short section asking for age and education of a subject.(H3) Our hypothesis was that the task in Study 1 would be more challenging for a subject than the one in Study 2. In Study 1, a subject has to evaluate a given model without any point of reference (only her/his idea of human likeness). For the second study a subject chooses one of three models presented for a given category which is known to her/him; thus, the decision should be simpler.ResultsFor the data analysis R statistical software ([9]; version 3.3.1) was used.Study 1The results of the first study are presented in Table 1 and Figures 2 and 3. Let us start with Figure 2. The ordering of models accordingly to their mean human-likeness assessment shows that they do not form a clear least human-like to most human-like pattern, because certain models are located at the same level. See e.g. models 2, 3, 7, 10 and 11 assigned by our subjects to the ‘It starts to look human-like’ category. As it is visible in Figure 1 these characters have distinct visual features. M2 is a simplistic robot with only one (red) colour visible, it has no facial features. M7 is a robot with more sophisticated features including human-like face. M3 is a simple animated character with cartoonish appearance and M10 is a zombie. We would expect that such different models would be differently assessed.Table 1:Human-likeness assessment for Study 1.ModelScore (median)Minimum scoreMaximum scoreRangeSDIQRM2 (robot)31430.761M13 (robot)11320.731M14 (robot)2.51430.681M5 (android)31430.731M7 (android)31430.760.75M15 (android)11540.981M8 (animated character)43520.670M3 (animated character)3.51540.961M6 (animated character)41540.901M10 (zombie)31431.011.75M4 (zombie)3.51541.031M11 (zombie)31431.031M9 (human)53520.500M1 (human)43520.671M12 (human)53520.731Figure 2:Study 1.Ordering of models accordingly to a median of human likeness.Figure 3:Study 1.Assessment of human likeness of presented models with the visible data spread. Models with uniform assessment are clearly visible (M8 animated character and M9 human) as well as model M10 (zombie), which obtained a wide range of evaluations.The lowest model on a human-likeness scale (‘Completely not human-like’) is M13 which is a very simplistic robot (one colour, no hands and foots visible, no human facial features). However, it gained the same marks as M15, which is much more detailed (e.g. it has visible hands with fingers).The most human-like models (‘Completely human-like’) pointed by our subjects were M9 (woman) and M12 (man). For the ‘Rather human-like’ category models: 6, 8 and 1 were assigned.Three models do not fit simply into our human-likeness categories. Model 14 is located between category ‘Rather not human-like’ and ‘It starts to look human-like’. Models 4 and 3 are between ‘It starts to look human-like’ and ‘Rather human-like’.Apart from analysing mean scores gained by the models, we also decided to take a closer look on the data spread. The aim was to check how uniform the human-likeness assessment was. As we have gathered ordinal data in Study 1, we use standard deviation (SD) and inter-quartile range (IQR) as measures of the data spread. We also check the range and the number of answers’ categories used by subject for a given model.IQR analysis for the study sample allows for identifying three models which differ from others: M8, M9 and M10.Only two models from the study sample gathered uniform assessments; these are M8 (animated character, IQR=0; SD=0.67) and M9 (human, IQR=0; SD=0.50) these are also clearly visible in Figure 2. The range measure for these models is only 2, and in both cases the lowest assigned category is 3 (‘It starts to look humanlike’).As for the other models from the study sample we may observe that their assessment was more difficult for subjects, which is reflected by the spread of the data. The most difficult group for subjects were zombies. In fact, M10 has the most dispersed marks in the whole study sample (IQR=1.75 and SD=1.01). Also, other zombie models, i.e. M4 and M11, have high SD and IQR values (1.03 and 1, respectively). Such a result may be the effect of not only categorisation problem for the models but also by emotional trigger related to zombie appearance (see [2]).The results for the animated characters group are also interesting. The distinct model is aforementioned M8 for which assessment was uniform. The other models from this group, M3 and M6, presented more complicated problem for the subjects (M3 with SD=0.96, IQR=1 and M6 with SD=0.90, IQR=1). The range for these models is 4, and they both received the lowest score (‘Completely not human-like’ at least once). This visible difference between models M8, M3 and M6 may be explained by the different style of these models. M3 and M6 are more cartoonish than M8 which is more detailed and explicit with human-like features and proportions.With this respect also, model M15 is distinct from other models in the android group. It shares similar data spread as the discussed animated characters (SD=0.98 and IQR=1, range=4). Other models from this group were slightly easier to categorise than M15. Model M5 (SD=0.73, IQR=1 and range=3) and model M7 (SD=0.75, IQR=0.75 and range=3) are similar to models for the robot category.Differences in the data spread inside our pre-established groups point also at another problem; see Figure 4, which presents the human-likeness evaluation of models grouped by three (from the left: robots, androids, zombies, animated characters and humans). Our groups (which seemed to be uniform and intuitively clear) overlap with each other. The assumption that we can use models of different types and that these types will line up on a human-likeness scale appear not to be so straightforward. Let us recall that in the case of Study 1, subjects were not informed about our pre-established categories. All they were presented were the models one by one.Figure 4:Study 1.Human-likeness assessment for models presented in groups by three (from the left: robots, androids, zombies, animated characters and humans).Study 2In the second study we have asked our subject to decide which model is the most typical representative of the presented group (robots, androids, zombies, animated characters and humans). As it may be expected the choice made by subject was not uniform.The most unproblematic category appears to be robots. Here 88% of subjects pointed out at M13 (M2 was pointed by 2%, and M14 by 10% of subjects). This result reflects our intuitions described in the Section ‘Methods’. M13 has a very robotic appearance, with visible joints, no distinct facial features and no hands and feet. One more observation is in order here – the choice of M13 is in line with the results of Study 1, where M13 is the least human-like model of the study sample. It is, however, difficult to explain the differences in assessing M2 and M14 as they display similar visual characteristics, and their DOH score was at the similar level.As for the android category the majority of subjects (76%) pointed out M7. M5 and M15 obtained 12% choices each. Especially the result for M5 and M15 are interesting, as these models are visually different – according to our initial intuitions M5 is more robotic than M15. In the light of DOH scores, the results obtained by M15 is also puzzling – according to Study 1 this model is more similar to M13 when we consider human likeness.For animated characters group we also observe a rather strong choice, which is M3 which was pointed out by 84% of subjects (for M8 we have 10% and for M6: 6%). In this case our intuitions were in line with these displayed by our subjects. M3 displays overall visual appearance of an animated character (like e.g. body proportions, which gives M3 the cartoonish look).For the zombie category, the results are more dispersed. We observe 51% choices for M11, 35% for M4 and 14% for M10. It suggests that the choice was more difficult than in the case of previously discussed groups. Similar observation may be made for the human group. Here the results are the following: M9 gained 54% subjects’ choices, while M1 and M12 were pointed by 23% each. The more dispersed results for these groups may be the result of less distinct visual differences of the models used (as compared with robots, android and animated characters). In the light of DOH scores the same assessment of M1 and M12 is also interesting. Let us recall that M12 and M9 obtained the same DOH score, while M1 was situated at the same level as animated characters M6 and M8.Since the data gathered in Study 2 are of categorical character, we have also decided to employ Fleiss Kappa measure in order to check the agreement level for the task. The Fleiss Kappa has been established with the use of R statistical software with irr package ([10]; version 0.84). The overall agreement for the whole study sample (i.e. five categories, 51 raters for each category) is 0.22. Such a result is interpreted as slight or fair agreement (see [11]) and is seen as a lack of adequate agreement over the discussed data set [12].Summary and discussionThe results presented in the previous section do not support our hypothesis for this research. Especially negative results for (H1) and (H2) are important in our opinion. We started with an arbitrary set of computer-rendered models and grouped them into five categories which (in our view) reflect the growing human likeness. Expectation was that the set could be used in the UVH research (constituting X axis). What the results suggest is that our models were differently assessed by subjects. This supports claims presented by [2] that the human-likeness assessment may be biased by many external factors. This leads to a conclusion that such analysis of the uniformity of human-likeness assessment should be an important element for UVH research.What is more, also our pre-established categories were not reflected in the assessments, which also suggests certain external factors that influence subjects’ judgements. This issue is also visible for Study 2.Study 2 was designed with an aim of providing a simpler task for our subjects. The idea is that the outcomes of pointing out the most typical representative of a given category (without human-likeness assessment) would be useful for ordering models from our study set. Obtained results suggest, however, that our subjects were far from agreement even in this task. What is more, the results suggest that gathering additional explanations from our subjects would be necessary for future studies of the discussed type. Such additional data would allow us to identify factors responsible for model classification and confront them with our intuitions.In our future research we plan to modify our questionnaires in order to control more variables, which may potentially influence DOH scores. More information about subjects will be collected (like e.g. interests in computer games) and also factors suggested by [2], including emotional reactions to models and self-assessed difficulty of human-likeness evaluation.Author contributions: The authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.Research funding: None declared.Employment or leadership: None declared.Honorarium: None declared.Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.References1.Mori M, MacDorman KF, Kageki N. The uncanny valley [from the field]. IEEE Robot Autom Mag 2012;19:98–100. (Original work published in 1970 in Japaneese).10.1109/MRA.2012.2192811MoriMMacDormanKFKagekiNThe uncanny valley [from the field]IEEE Robot Autom Mag20121998100(Original work published in 1970 in Japaneese)2.Kätsyri J, Förger K, Mäkäräinen M, Takala T. A review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eeriness. Front Psychol 2015;6:390.http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000352587700001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f325914661KätsyriJFörgerKMäkäräinenMTakalaTA review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eerinessFront Psychol201563903.Ueyama Y. A bayesian model of the uncanny valley effect for explaining the effects of therapeutic robots in autism spectrum disorder. PLoS One 2015;10:e0138642.26389805http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000361791000046&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f310.1371/journal.pone.0138642UeyamaYA bayesian model of the uncanny valley effect for explaining the effects of therapeutic robots in autism spectrum disorderPLoS One201510e01386424.MacDorman KF, Green RD, Ho C-C, Koch CT. Too real for comfort? Uncanny responses to computer generated faces. Comput Hum Behav 2009;25:695–710.10.1016/j.chb.2008.12.026http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000266418500010&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3MacDormanKFGreenRDHoC-CKochCTToo real for comfort? Uncanny responses to computer generated facesComput Hum Behav2009256957105.Geller T. Overcoming the uncanny valley. IEEE Comput Graph Appl 2008;28:11–7.1866381310.1109/MCG.2008.79http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000257285100003&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3GellerTOvercoming the uncanny valleyIEEE Comput Graph Appl2008281176.Gee F, Browne WN, Kawamura K. Uncanny valley revisited. In: Mitch Wilkes, editor. Robot and Human Interactive Communication, ROMAN 2005. IEEE International Workshop on IEEE, Nashville, TN, USA, 2005:151–7. Doi: 10.1109/ROMAN.2005.1513772.GeeFBrowneWNKawamuraKUncanny valley revisitedWilkesMitchRobot and Human Interactive Communication, ROMAN 2005IEEE International Workshop on IEEENashville, TN, USA2005151710.1109/ROMAN.2005.15137727.Piwek L, McKay LS, Pollick FE. Empirical evaluation of the uncanny valley hypothesis fails to confirm the predicted effect of motion. Cognition 2014;130:271–7.2437401910.1016/j.cognition.2013.11.001http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000332057000001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3PiwekLMcKayLSPollickFEEmpirical evaluation of the uncanny valley hypothesis fails to confirm the predicted effect of motionCognition201413027178.Dill V, Flach LM, Hocevar R, Lykawka C, Musse SR, Pinho MS. Evaluation of the Uncanny Valley in CG characters. In: Yukiko N, Neff M, Paiva A, Walker M, editors. International Conference on Intelligent Virtual Agents. Berlin: Springer, 2012:511–3.DillVFlachLMHocevarRLykawkaCMusseSRPinhoMSEvaluation of the Uncanny Valley in CG charactersYukikoNNeffMPaivaAWalkerMInternational Conference on Intelligent Virtual AgentsBerlinSpringer201251139.R Core Team. 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed: 20 Mar 2017.R Core Team2013R: A language and environment for statistical computingR Foundation for Statistical ComputingVienna, AustriaAccessed: 20 Mar 201710.Gamer M, Lemon J, Singh IF. irr: Various coefficients of interrater reliability and agreement. Accessed: 20 Mar 2017, R package version 0.84, 2012.GamerMLemonJSinghIFirr: Various coefficients of interrater reliability and agreementAccessed: 20 Mar 2017R package version 0.84, 201211.Viera AJ, Garrett JM. Understanding interobserver agreement: The kappa statistic. Fam Med 2005;37:360–3.15883903VieraAJGarrettJMUnderstanding interobserver agreement: The kappa statisticFam Med200537360312.Hallgren KA. Computing inter-rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol 2012;8:23–34.2283377610.20982/tqmp.08.1.p023HallgrenKAComputing inter-rater reliability for observational data: an overview and tutorialTutor Quant Methods Psychol201282334 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Human-likeness assessment for the Uncanny Valley Hypothesis

Loading next page...
 
/lp/de-gruyter/human-likeness-assessment-for-the-uncanny-valley-hypothesis-prKFpIwJiD

References (11)

Publisher
de Gruyter
Copyright
©2017 Walter de Gruyter GmbH, Berlin/Boston
ISSN
1896-530X
eISSN
1896-530X
DOI
10.1515/bams-2017-0008
Publisher site
See Article on Publisher Site

Abstract

IntroductionThe Uncanny Valley Hypothesis (hereafter UVH) has been formulated by Masahiro Mori (see [1]). Mori hypothesises that when we present a subject with a series of different human-like models (including robots) certain models would trigger negative reactions (uneasiness, eeriness). As he claims these will be almost human-like characters. We may imagine models presented in order, from least human-like (like e.g. robotic arm) to the most human-like ones on the X axis. On the Y axis we would present affinity level. According to Mori’s suggestion we would observe a growing level of affinity as we move towards human-like models, but at a certain level of human likeness the level of affinity will rapidly get lower; this is the ‘valley’ (or as MacDorman and Norri Kageki call it ‘descent into eeriness’ – [1]).As such, UVH gained much attention in both research [for example, only the Frontiers in Psychology journal reports the following impact for the UVH Research Topic for the last 12 months: 13,967 total views, 11,517 article views, 1567 article downloads. http://journal.frontiersin.org/researchtopic/2385/the-uncanny-valley-hypothesis-and-beyond#impact (Access 24.03.2017)] and popular media [see e.g. 2004 “Review: ‘Polar Express’ a creepy ride Technology brilliant, but where’s the heart and soul?” by Paul Clinton for CNN.com (http://edition.cnn.com/2004/SHOWBIZ/Movies/11/10/review.polar.express/, Access 24.03.2017) or Stephanie Lay, ‘Uncanny valley: why we find human-like robots and dolls so creepy’, The Guardian, 2015, https://www.theguardian.com/commentisfree/2015/nov/13/robots-human-uncanny-valley (Access 24.03.2017)]. As [2] note the reason for this is the wide range of applicability and importance of this hypothesis for a wide range of disciplines, from social robotics to computer animation (In the aforementioned review of ‘The Polar Express’ Paul Clinton writes: ‘The overall artwork is remarkable, and the action sequences are inventive and emotionally gripping. […] But those human characters in the film come across as downright… well, creepy. So ‘The Polar Express’ is at best disconcerting, and at worst, a wee bit horrifying.’) and game design (see e.g. [3] and an overview in [4] and [5]).Kätsyri et al. [2] in their exhaustive review of different theoretical models of UVH and their up-to-date empirical support note that ‘it is surprising that empirical evidence for the UVH is still ambiguous if not non-existent’. Their main conclusion is the need of deep analysis of the hypothesis itself, which would include inspection of human likeness and affinity dimensions of the UVH. Especially as [2] notice, Mori ‘himself used anecdotal examples to characterize different degrees of human-likeness’ using industrial robots, toy robots, prosthetic and even puppets, corpses and zombies. Such a variety of proposed models may trigger various levels of evaluating human likeness when it comes to empirical studies. As [2] stress, some of these aforementioned models (while treated literally) may trigger extraneous factors during studies. Puppets may address aesthetics level of evaluation, while corpses and zombies may simply evoke strong negative emotions (see also interesting discussion presented in [6]).These observations lead us to our research problem. We wanted to check whether evaluation of degrees of human likeness (DOH) of a particular arbitrary chosen computer-generated model is unproblematic. While reviewing UVH-related literature, we may observe studies that simply use Mori’s examples or types of models directly inspired by Mori’s paper. What is also notable is that in many cases UVH research settings simply take certain sets of models (photos, computer-generated graphics and videos) and rank them from least to most human like in an arbitrary fashion. Often these models are also grouped into arbitrary categories. Let us consider the following examples. In ([7], p. 273) we readTo create the stimuli we started by choosing seven 3D computer characters that included ones similar to those suggested by Mori: battle robot, toy robot, mannequin, skeleton, zombie, and low- and high-quality man […]Similarly in ([8], p. 110), authors explain how they choose models for their research:Considering the questions that must be covered, we tried to choose a not very large number of characters to avoid a too long form. Criteria for choosing the characters were set in order to obtain a group that allowed us to cover all the questions above. The first criterion was the human likeness of each character. This feature is necessary because it is present in the uncanny valley graph (the horizontal axis). […] To cover the human likeness feature, we chosen characters intended to represent accurately a human […]What we want to verify in our studies is whether such a choice is so simple and may be done authoritatively. We want to check how difficult DOH assessment is for a given set of models and also how difficult it is to evaluate a given model as typical for a certain category, like e.g. as a robot or an android.The paper is structured as follows. In the next section we describe two studies we have designed, one where subjects assessed DOH for a set of computer-generated models and the second, where subjects were asked to point the most characteristic model for a given category. In what follows we report the results of these studies, and we discuss them as well as the potential consequences for designing UVH studies and future research.MethodsTwo studies for this research were conducted as on-line questionnaires with the use of Google Form tool. The data were collected from December 2016 to January 2017. The language of two questionnaires was Polish. Subjects were recruited via e-mail invitation (invitations for the first and second study were sent to separate groups, so that participants from the first study would not take part in the second one). In the first study 30 participants took part (15 women and 15 men; mean age of a participant was 26 years; 63% of subjects declared high, education and 36% declared higher education). The group characteristics for the second study resembles the one for Study 1. Fifty-one participants took part (28 women, 23 men, mean age 26 years; education: 43% high and 57% higher).For our research we have prepared 15 computer-rendered models. Models were retrieved from 3D characters banks (see http://tf3dm.com and https://www.mixamo.com/) and rendered with the use of the Unity environment (see https://unity3d.com/). All the models were chosen arbitrarily by the authors and then consulted with two designers experienced in game development. Models were grouped into five categories: robot, android, zombie, animated character and human. We intended to create categories that reflect growing human likeness of the models. The key features for dividing models into categories were visible facial features, detail level of the model (e.g. visible hands and fingers), overall style of the model. For example, models classified as robots do not have visible eyes and their hands and feet are not detailed. Also, joints of body parts are clearly visible, which give them very mechanical appearance.All models were presented front face, on the uniform grey and white background and were scaled to the same size. The set of models used in the research is presented in Figure 1 (models are grouped by three according to their assigned category; model numbers reflect their order of appearance in Study 1).Figure 1:Computer-generated models used for the research.Models are grouped by three according to their assigned category (from top left: robot, android, zombie, animated character and human). Model numbers reflect their order of appearance in Study 1.Models were embedded into two studies.Study 1As it was mentioned, the study was conducted on-line with the use of Google Forms. There were no time constraints on filling the questionnaire. At the beginning of the questionnaire information about the study and the instruction were displayed. In the main part of the study each model appears on the screen separately one by one. Models are presented in random order. After a model is presented to a subject she/he is asked to assess the degree of human likeness of the model on a scale of 1–5, where (1)=Completely not human like; (2)=Rather not human-like; (3)=It starts to look human like; (4)=Rather human-like; and (5)=Completely human-like. The questionnaire ended with a short section asking for age and education of a subject. [Anonymous reviewer for this journal suggested that it would be beneficial for the study to control also the type of education (technical/non-technical) and interest in computer games. We agree that these may be important factors influencing the results of our study. We plan to include such questions in our future studies, as we discuss in the summary.](H1) Our hypothesis here was that we should obtain uniform results that will allow for unproblematic ordering of our arbitrary models from least to most human-like character. That is, we were expecting results that will allow for constituting X axis (human likeness) for UVH research.(H2) What is more, we were expecting that the results should reflect our choices of categories for the models set.Study 2For the second study we have also used an on-line questionnaire in Google Forms. There were no time constraints on filling the questionnaire. After the introductory section covering information about the study and instructions, subjects were presented with models in groups of three, reflecting the pre-established categories of robots, androids, zombies, animated characters and humans. Groups were presented one by one in random order. After each group a subject was asked to decide which model is the most typical representative of the group. For each group of models its name was clearly visible. The questionnaire ended with a short section asking for age and education of a subject.(H3) Our hypothesis was that the task in Study 1 would be more challenging for a subject than the one in Study 2. In Study 1, a subject has to evaluate a given model without any point of reference (only her/his idea of human likeness). For the second study a subject chooses one of three models presented for a given category which is known to her/him; thus, the decision should be simpler.ResultsFor the data analysis R statistical software ([9]; version 3.3.1) was used.Study 1The results of the first study are presented in Table 1 and Figures 2 and 3. Let us start with Figure 2. The ordering of models accordingly to their mean human-likeness assessment shows that they do not form a clear least human-like to most human-like pattern, because certain models are located at the same level. See e.g. models 2, 3, 7, 10 and 11 assigned by our subjects to the ‘It starts to look human-like’ category. As it is visible in Figure 1 these characters have distinct visual features. M2 is a simplistic robot with only one (red) colour visible, it has no facial features. M7 is a robot with more sophisticated features including human-like face. M3 is a simple animated character with cartoonish appearance and M10 is a zombie. We would expect that such different models would be differently assessed.Table 1:Human-likeness assessment for Study 1.ModelScore (median)Minimum scoreMaximum scoreRangeSDIQRM2 (robot)31430.761M13 (robot)11320.731M14 (robot)2.51430.681M5 (android)31430.731M7 (android)31430.760.75M15 (android)11540.981M8 (animated character)43520.670M3 (animated character)3.51540.961M6 (animated character)41540.901M10 (zombie)31431.011.75M4 (zombie)3.51541.031M11 (zombie)31431.031M9 (human)53520.500M1 (human)43520.671M12 (human)53520.731Figure 2:Study 1.Ordering of models accordingly to a median of human likeness.Figure 3:Study 1.Assessment of human likeness of presented models with the visible data spread. Models with uniform assessment are clearly visible (M8 animated character and M9 human) as well as model M10 (zombie), which obtained a wide range of evaluations.The lowest model on a human-likeness scale (‘Completely not human-like’) is M13 which is a very simplistic robot (one colour, no hands and foots visible, no human facial features). However, it gained the same marks as M15, which is much more detailed (e.g. it has visible hands with fingers).The most human-like models (‘Completely human-like’) pointed by our subjects were M9 (woman) and M12 (man). For the ‘Rather human-like’ category models: 6, 8 and 1 were assigned.Three models do not fit simply into our human-likeness categories. Model 14 is located between category ‘Rather not human-like’ and ‘It starts to look human-like’. Models 4 and 3 are between ‘It starts to look human-like’ and ‘Rather human-like’.Apart from analysing mean scores gained by the models, we also decided to take a closer look on the data spread. The aim was to check how uniform the human-likeness assessment was. As we have gathered ordinal data in Study 1, we use standard deviation (SD) and inter-quartile range (IQR) as measures of the data spread. We also check the range and the number of answers’ categories used by subject for a given model.IQR analysis for the study sample allows for identifying three models which differ from others: M8, M9 and M10.Only two models from the study sample gathered uniform assessments; these are M8 (animated character, IQR=0; SD=0.67) and M9 (human, IQR=0; SD=0.50) these are also clearly visible in Figure 2. The range measure for these models is only 2, and in both cases the lowest assigned category is 3 (‘It starts to look humanlike’).As for the other models from the study sample we may observe that their assessment was more difficult for subjects, which is reflected by the spread of the data. The most difficult group for subjects were zombies. In fact, M10 has the most dispersed marks in the whole study sample (IQR=1.75 and SD=1.01). Also, other zombie models, i.e. M4 and M11, have high SD and IQR values (1.03 and 1, respectively). Such a result may be the effect of not only categorisation problem for the models but also by emotional trigger related to zombie appearance (see [2]).The results for the animated characters group are also interesting. The distinct model is aforementioned M8 for which assessment was uniform. The other models from this group, M3 and M6, presented more complicated problem for the subjects (M3 with SD=0.96, IQR=1 and M6 with SD=0.90, IQR=1). The range for these models is 4, and they both received the lowest score (‘Completely not human-like’ at least once). This visible difference between models M8, M3 and M6 may be explained by the different style of these models. M3 and M6 are more cartoonish than M8 which is more detailed and explicit with human-like features and proportions.With this respect also, model M15 is distinct from other models in the android group. It shares similar data spread as the discussed animated characters (SD=0.98 and IQR=1, range=4). Other models from this group were slightly easier to categorise than M15. Model M5 (SD=0.73, IQR=1 and range=3) and model M7 (SD=0.75, IQR=0.75 and range=3) are similar to models for the robot category.Differences in the data spread inside our pre-established groups point also at another problem; see Figure 4, which presents the human-likeness evaluation of models grouped by three (from the left: robots, androids, zombies, animated characters and humans). Our groups (which seemed to be uniform and intuitively clear) overlap with each other. The assumption that we can use models of different types and that these types will line up on a human-likeness scale appear not to be so straightforward. Let us recall that in the case of Study 1, subjects were not informed about our pre-established categories. All they were presented were the models one by one.Figure 4:Study 1.Human-likeness assessment for models presented in groups by three (from the left: robots, androids, zombies, animated characters and humans).Study 2In the second study we have asked our subject to decide which model is the most typical representative of the presented group (robots, androids, zombies, animated characters and humans). As it may be expected the choice made by subject was not uniform.The most unproblematic category appears to be robots. Here 88% of subjects pointed out at M13 (M2 was pointed by 2%, and M14 by 10% of subjects). This result reflects our intuitions described in the Section ‘Methods’. M13 has a very robotic appearance, with visible joints, no distinct facial features and no hands and feet. One more observation is in order here – the choice of M13 is in line with the results of Study 1, where M13 is the least human-like model of the study sample. It is, however, difficult to explain the differences in assessing M2 and M14 as they display similar visual characteristics, and their DOH score was at the similar level.As for the android category the majority of subjects (76%) pointed out M7. M5 and M15 obtained 12% choices each. Especially the result for M5 and M15 are interesting, as these models are visually different – according to our initial intuitions M5 is more robotic than M15. In the light of DOH scores, the results obtained by M15 is also puzzling – according to Study 1 this model is more similar to M13 when we consider human likeness.For animated characters group we also observe a rather strong choice, which is M3 which was pointed out by 84% of subjects (for M8 we have 10% and for M6: 6%). In this case our intuitions were in line with these displayed by our subjects. M3 displays overall visual appearance of an animated character (like e.g. body proportions, which gives M3 the cartoonish look).For the zombie category, the results are more dispersed. We observe 51% choices for M11, 35% for M4 and 14% for M10. It suggests that the choice was more difficult than in the case of previously discussed groups. Similar observation may be made for the human group. Here the results are the following: M9 gained 54% subjects’ choices, while M1 and M12 were pointed by 23% each. The more dispersed results for these groups may be the result of less distinct visual differences of the models used (as compared with robots, android and animated characters). In the light of DOH scores the same assessment of M1 and M12 is also interesting. Let us recall that M12 and M9 obtained the same DOH score, while M1 was situated at the same level as animated characters M6 and M8.Since the data gathered in Study 2 are of categorical character, we have also decided to employ Fleiss Kappa measure in order to check the agreement level for the task. The Fleiss Kappa has been established with the use of R statistical software with irr package ([10]; version 0.84). The overall agreement for the whole study sample (i.e. five categories, 51 raters for each category) is 0.22. Such a result is interpreted as slight or fair agreement (see [11]) and is seen as a lack of adequate agreement over the discussed data set [12].Summary and discussionThe results presented in the previous section do not support our hypothesis for this research. Especially negative results for (H1) and (H2) are important in our opinion. We started with an arbitrary set of computer-rendered models and grouped them into five categories which (in our view) reflect the growing human likeness. Expectation was that the set could be used in the UVH research (constituting X axis). What the results suggest is that our models were differently assessed by subjects. This supports claims presented by [2] that the human-likeness assessment may be biased by many external factors. This leads to a conclusion that such analysis of the uniformity of human-likeness assessment should be an important element for UVH research.What is more, also our pre-established categories were not reflected in the assessments, which also suggests certain external factors that influence subjects’ judgements. This issue is also visible for Study 2.Study 2 was designed with an aim of providing a simpler task for our subjects. The idea is that the outcomes of pointing out the most typical representative of a given category (without human-likeness assessment) would be useful for ordering models from our study set. Obtained results suggest, however, that our subjects were far from agreement even in this task. What is more, the results suggest that gathering additional explanations from our subjects would be necessary for future studies of the discussed type. Such additional data would allow us to identify factors responsible for model classification and confront them with our intuitions.In our future research we plan to modify our questionnaires in order to control more variables, which may potentially influence DOH scores. More information about subjects will be collected (like e.g. interests in computer games) and also factors suggested by [2], including emotional reactions to models and self-assessed difficulty of human-likeness evaluation.Author contributions: The authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.Research funding: None declared.Employment or leadership: None declared.Honorarium: None declared.Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.References1.Mori M, MacDorman KF, Kageki N. The uncanny valley [from the field]. IEEE Robot Autom Mag 2012;19:98–100. (Original work published in 1970 in Japaneese).10.1109/MRA.2012.2192811MoriMMacDormanKFKagekiNThe uncanny valley [from the field]IEEE Robot Autom Mag20121998100(Original work published in 1970 in Japaneese)2.Kätsyri J, Förger K, Mäkäräinen M, Takala T. A review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eeriness. Front Psychol 2015;6:390.http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000352587700001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f325914661KätsyriJFörgerKMäkäräinenMTakalaTA review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eerinessFront Psychol201563903.Ueyama Y. A bayesian model of the uncanny valley effect for explaining the effects of therapeutic robots in autism spectrum disorder. PLoS One 2015;10:e0138642.26389805http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000361791000046&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f310.1371/journal.pone.0138642UeyamaYA bayesian model of the uncanny valley effect for explaining the effects of therapeutic robots in autism spectrum disorderPLoS One201510e01386424.MacDorman KF, Green RD, Ho C-C, Koch CT. Too real for comfort? Uncanny responses to computer generated faces. Comput Hum Behav 2009;25:695–710.10.1016/j.chb.2008.12.026http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000266418500010&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3MacDormanKFGreenRDHoC-CKochCTToo real for comfort? Uncanny responses to computer generated facesComput Hum Behav2009256957105.Geller T. Overcoming the uncanny valley. IEEE Comput Graph Appl 2008;28:11–7.1866381310.1109/MCG.2008.79http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000257285100003&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3GellerTOvercoming the uncanny valleyIEEE Comput Graph Appl2008281176.Gee F, Browne WN, Kawamura K. Uncanny valley revisited. In: Mitch Wilkes, editor. Robot and Human Interactive Communication, ROMAN 2005. IEEE International Workshop on IEEE, Nashville, TN, USA, 2005:151–7. Doi: 10.1109/ROMAN.2005.1513772.GeeFBrowneWNKawamuraKUncanny valley revisitedWilkesMitchRobot and Human Interactive Communication, ROMAN 2005IEEE International Workshop on IEEENashville, TN, USA2005151710.1109/ROMAN.2005.15137727.Piwek L, McKay LS, Pollick FE. Empirical evaluation of the uncanny valley hypothesis fails to confirm the predicted effect of motion. Cognition 2014;130:271–7.2437401910.1016/j.cognition.2013.11.001http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000332057000001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3PiwekLMcKayLSPollickFEEmpirical evaluation of the uncanny valley hypothesis fails to confirm the predicted effect of motionCognition201413027178.Dill V, Flach LM, Hocevar R, Lykawka C, Musse SR, Pinho MS. Evaluation of the Uncanny Valley in CG characters. In: Yukiko N, Neff M, Paiva A, Walker M, editors. International Conference on Intelligent Virtual Agents. Berlin: Springer, 2012:511–3.DillVFlachLMHocevarRLykawkaCMusseSRPinhoMSEvaluation of the Uncanny Valley in CG charactersYukikoNNeffMPaivaAWalkerMInternational Conference on Intelligent Virtual AgentsBerlinSpringer201251139.R Core Team. 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed: 20 Mar 2017.R Core Team2013R: A language and environment for statistical computingR Foundation for Statistical ComputingVienna, AustriaAccessed: 20 Mar 201710.Gamer M, Lemon J, Singh IF. irr: Various coefficients of interrater reliability and agreement. Accessed: 20 Mar 2017, R package version 0.84, 2012.GamerMLemonJSinghIFirr: Various coefficients of interrater reliability and agreementAccessed: 20 Mar 2017R package version 0.84, 201211.Viera AJ, Garrett JM. Understanding interobserver agreement: The kappa statistic. Fam Med 2005;37:360–3.15883903VieraAJGarrettJMUnderstanding interobserver agreement: The kappa statisticFam Med200537360312.Hallgren KA. Computing inter-rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol 2012;8:23–34.2283377610.20982/tqmp.08.1.p023HallgrenKAComputing inter-rater reliability for observational data: an overview and tutorialTutor Quant Methods Psychol201282334

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Sep 26, 2017

There are no references for this article.