Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Multiple-Perspective Clustering of Passive Wi-Fi Sensing Trajectory Data

Multiple-Perspective Clustering of Passive Wi-Fi Sensing Trajectory Data Multiple-Perspective Clustering of Passive Wi-Fi Sensing Trajectory Data Zann Koh, Yuren Zhou*, Member, IEEE, Billy Pik Lik Lau, Student Member, IEEE Chau Yuen, Senior Mem- ber, IEEE, Bige Tunc ¸er, and Keng Hua Chong Abstract—Information about the spatiotemporal flow of hu- Camera-based tracking is usually limited to sparse crowds [9] mans within an urban context has a wide plethora of applications. and a small coverage area [10]. As limitations based on line- Currently, although there are many different approaches to of-sight will be inherent in all camera-based tracking methods, collect such data, there lacks a standardized framework to we will look into using a different method of tracking human analyze it. The focus of this paper is on the analysis of the micro-mobility to avoid this. Alternative methods of tracking data collected through passive Wi-Fi sensing, as such passively collected data can have a wide coverage at low cost. We passively include the passive sniffing of Bluetooth and Wi-Fi propose a systematic approach by using unsupervised machine signals from mobile phones. learning methods, namely k-means clustering and hierarchical Passive sniffing of Bluetooth signals has been used in agglomerative clustering (HAC) to analyze data collected through contexts such as detecting human behavior in shopping mall such a passive Wi-Fi sniffing method. We examine three aspects environments [12] as well as in public spaces such as the of clustering of the data, namely by time, by person, and by location, and we present the results obtained by applying our Louvre [13]. However, Bluetooth devices mainly operate in proposed approach on a real-world dataset collected over five the range of 10 meters or so. This upper limit causes passive months. Bluetooth sensing to be less feasible for coverage of a large area such as a residential estate, as it would theoretically Index Terms—passive Wi-Fi sensing, Wi-Fi sniffing, data min- ing, spatiotemporal, clustering. require the placement and maintenance of a large number of sensors throughout the course of the study. With Wi-Fi signals being detectable at a comparatively larger range of tens of I. I NTRODUCTION meters, passive sensing of Wi-Fi signals was thus chosen as NDERSTANDING the spatiotemporal flow of humans the method to focus on for this paper. within urban areas is a useful undertaking as it has Within the available literature which makes use of passively applications in fields such as urban planning [1], [2], crowd detected Wi-Fi signals to make mobility-related inferences, monitoring [3], and targeted advertising [4]. Such tracking of there are some common methods of analysis. Most of them human flow can come under two categories: active and passive. present the counts of detected devices over time, as well as Active tracking involves obtaining data from participants who heatmaps superimposed over maps of the region of detection. are actively engaged in the data collection process. One diffi- Some works [14]–[16] also have illustrations of the strength of culty present in the use of such active data collection methods each direct connection between sensors as a form of analysis is that many countries have policies put into place to protect of flow between each sensor. the privacy of users, which makes such data unobtainable from However, previous literature on the use of Wi-Fi signals in apps without users’ consent. On top of these policies, it is mobility tracking lack the in-depth use of machine learning difficult to obtain such consent and active participation from methods to examine passively collected Wi-Fi data in detail. a large number of people. The main challenges and research gap are as follows: In contrast to active tracking, passive tracking collects infor- Lack of labels for Wi-Fi data - as passively collected mation about people without the need for active participation. Wi-Fi data is collected without the subjects’ knowledge, Under the category of passive tracking, some currently used there is no way to verify the accuracy of each person’s methods include cellular activity tracking [5]–[8], and the time and location data with the person themselves. use of cameras [9]–[11]. However, these methods may have Lack of systematic approach in previous literature - in certain drawbacks under certain conditions. For instance, the previous literature, there is a lack of a standardized resolution of cellular methods depends on the density of cell systematic approach to address such passively collected towers in a region, which is usually in the scale of kilometers Wi-Fi data. and would be too coarse to measure micro-mobility of humans. Noise present in Wi-Fi data - there will be noise present in real-world data when collecting it, therefore we have *Corresponding author to find some way to clean it in order to extract useful Zann Koh, Yuren Zhou, Billy Pik Lik Lau, and Chau Yuen are with the Pillar insights. of Engineering and Product Development, Singapore University of Technology and Design, Singapore. (e-mail: yuren zhou@sutd.edu.sg) In this paper, different from previous literature, we use Bige Tunc ¸er and Keng Hua Chong are with the Pillar of Architecture clustering, a type of unsupervised machine learning technique, and Sustainable Design, Singapore University of Technology and Design, Singapore. to analyze a large real-world dataset and thus obtain more arXiv:2012.11796v1 [cs.LG] 22 Dec 2020 2 insights than those obtained through the above methods alone. evidenced by [20], in which participants were asked through Clustering allows us to find groups of similar patterns within letters to run an app for seven days which would log their unlabeled data. As will be shown in the following sections, GPS location, and they were required to input locations where when applying clustering to Wi-Fi sensing data, one can obtain they stayed for over half an hour. That experiment had a detailed multi-angle patterns related to people’s behaviors in 1.33% participation rate (45 individuals consented out of a certain residential or commercial district, which can be very 3380 requests sent out). In addition, due to high reliance on helpful for district design and management. Examples include individual users’ compliance in self-reporting, data obtained the clustering by time in Section IV, where the results can through active methods may have some intrinsic bias and is be used to check the similarities of people flow in different likely to be scarcer than desired. Thus, for this study, we have buildings across different days. This can help to inform the chosen to use a passive tracking method. planning decisions of large-scale events in different buildings Other than active tracking methods, there are also methods to avoid congestion, or to organize a roadshow during certain to passively track human mobility, such as cellular data periods of time to capture a large number of passersby. The tracking using cell towers [5]–[8], the use of cameras [9]–[11], results of clustering by person as described in Section V can and passive sensing of mobile phone signals such as Bluetooth break down the population into a few groups with distinct and Wi-Fi. behaviors, which can possibly have applications in marketing. While cell towers are readily present in the infrastructure of Lastly, clustering by location as described in Section VI the country and the user base for collection of data is large, the can help local authorities or urban planners detect whether data collected is difficult to access as it requires going through the facilities within a residential estate are being utilized as the service providers. Cell tower data also has a granularity in planned, as well as track common routes of human travel the range of kilometers, which is too coarse to examine the within an estate for applications in future residential estate micro-mobility of humans. As for camera-based tracking, it is projects. mainly limited to sparse crowds [9] and a small coverage area Therefore, this paper has the following contributions: [10]. Compounded with the inability to estimate the current distribution of people flowing into and out of each segment of We propose a systematic approach which applies the the video [11], it is less feasible for use in our current study, unsupervised machine learning techniques, namely k- which aims to study the micro-mobility of people within a means clustering and hierarchical agglomerative cluster- residential estate and the nearby set of buildings, which would ing (HAC) to cluster a dataset gathered via the low-cost theoretically require a considerably large number of cameras means of passive Wi-Fi sensing. to cover the entire area. We propose to analyze the data in three aspects - by time, We then turn to passive sensing of mobile phone signals. by person, and by location. When making a decision between the use of Wi-Fi signals Finally, we apply the proposed approach on a real- as compared to Bluetooth, we consider that it is more likely world dataset collected over a period of five months and for a given device to have its Wi-Fi on as compared to its covering an area of approximately 0.52 km , and provide Bluetooth [21]. Sensing of Wi-Fi signals have been used in a detailed analysis of the clustering results. previous literature and works for finer resolution as compared The structure of the remaining sections will be as follows: to cell towers, as detection of packets is limited to the radius Section II will discuss related works in detail. Section III of Wi-Fi detection, which is typically less than 100 meters. will briefly go through the steps taken for data collection and Additionally, the authors in [22] have shown that it is possible preprocessing of the obtained data. Sections IV to VI will to infer a large proportion of mobility of a population from a describe and present the results of clustering the data in three time-series of Wi-Fi signals coupled with a small number of aspects - by time, by person, and by location respectively. GPS data samples. This supports the use of passively collected Finally, Section VII will conclude the paper. Wi-Fi signals as a viable method of location detection over time. II. R ELATED WORKS In the previous literature, there have been a few works This section gives an overview of the related works divided which authors have used sensors that were mobile, for example into the subsections of mobility tracking methods and cluster- mobile phones held by volunteers [23], or laptops held by ing methods. researchers themselves [24]. The use of these sensors are manpower intensive and therefore become less feasible for A. Tracking Methods data collection of daily mobility over a long period of time. Methods of active tracking of human mobility actively For the remaining majority of works that have used sensing involve participants whose data is being tracked. Examples can of Wi-Fi signals, the sensors used mostly consist of Wi-Fi be found in [17], which used specific GPS sensing devices in access points or self-implemented infrastructures, which are tandem with interviews and questionnaires; in [18], where the stationary over the course of the data collection. This means mobility and demographic data of participants were collected that they do not require volunteer participation, and thus are through the use of a mobile application; as well as in [19], ideal for passive people counting and tracking. Out of these papers that use stationary Wi-Fi sensors for where geotagged data from a social media app was used. However, the consent of participants is difficult to obtain, their tracking, some mainly collect data for specific events especially due to privacy concerns and legislation. This is rather than for daily human movement [25]–[28]; some are 3 directed toward specific industries’ applications such as in- of algorithm, as it is also scalable to large datasets and has vestigating the movements within a hospital [29], within a an intuitive visualization in terms of the dendrogram. Users university [30], within a shopping mall [31], [32], within a are able to easily view the distance between each level of public district with shopping malls and office buildings [33], agglomerated clusters and decide on an appropriate distance and among popular tourist locations on a set of islands [14]; threshold. Thus, for this paper, we will be using the k-means and a few cover large enough parts of cities such as Manhattan clustering algorithm as well as the HAC algorithm for the and Copenhagen to identify the daily movement of commuters purposes of clustering. between work and home [15], [16], [22]. Although the scopes of these papers vary widely, they have certain similarities in III. DATA COLLECTION AND PREPROCESSING the analysis of the data. Most of them present the counts of A. Data Collection detected devices over time, as well as heatmaps superimposed Passive Wi-Fi sensors were deployed at selected locations over maps of the region of detection. Some works [14]– of interest in two areas of the city: a Facility area and a Resi- [16] also have illustrations of the strength of each direct dential area. The Facility area consists of a hospital building, connection between sensors as a form of analysis of flow an educational institution, and four shopping mall buildings between each sensor. However, these previous literature all containing offices. For ease of explanation, the shopping mall lack the systematic use of machine learning methods to gather buildings will be referred to as Malls 1 to 4, respectively. deeper insights into their mobility data. Therefore, this paper Malls 3 and 4 are located near a transportation hub, which is will propose the use of a machine learning method, clustering, hypothesised to be the main means of transport to and from to discover common mobility patterns from the data we have the Facility area from other places outside. All of the buildings collected. in the Facility area are connected with an elevated walkway, and our sensors were installed at the entrances to the buildings from the walkways. The Residential area has 38 blocks in total, B. Clustering Methods and our sensors were deployed at 14 selected locations among Within the field of machine learning, there are two main these buildings. The distance between these 2 areas is in the types: supervised learning, which requires labeled data, and range of a few hundred metres, and this Facility area would be unsupervised learning, which does not. In this case, as the the nearest commercial center to the Residential area. Sketches data we have collected lacks ground truth, we will be using of the different buildings in each area can be found in Fig. 8 an unsupervised machine learning method, more specifically for a clear understanding. clustering. Clustering methods find groups in the input data The sensors passively collect Wi-Fi probe packets sent from based on input parameters and a distance measure. nearby mobile devices. Each sensor is a box containing a As our data size is large, we would require the algorithms Wi-Fi sniffer, which is built on top of the Raspberry Pi we use to be scalable. Some algorithms are not scalable, such Model B with additional WI-PI USB dongle for probe request as Mean Shift clustering [34], which requires multiple nearest collection. Local processing of probe requests is performed to neighbor searches during the execution of the algorithm, and reduce the volume of uploaded data, in turn reducing costs of Balanced Iterative Reducing and Clustering using Hierarchies data transmission and storage. This local processing involves (BIRCH) [35], which does not scale well to high dimensional combining probe requests with a chronological separation of data. Thus, these algorithms are less ideal for this study. 3 minutes or less [47]. The processed data is then uploaded to Other clustering algorithms such as Density-Based Spatial the cloud via cellular connection. A flowchart illustrating the Clustering of Applications with Noise (DBSCAN) [36] and above process can be found in Fig. 1. Ordering Points to Identify the Clustering Structure (OPTICS) [37] are computationally more complex and requires more B. Preprocessing complicated, potentially iterative parameter selection. Addi- tionally, as we hope to potentially discover unusual mobil- The collected data has to be further processed using the ity patterns, it would be more difficult to tune DBSCAN’s method in our earlier work [47] to obtain the trajectory of parameters to obtain explainable and meaningful clusters, as a particular mobile device with a unique MAC address. The compared to k-means. trajectories obtained using the above method are sensor level The k-means clustering algorithm [38], on the other hand, trajectories. For this study, we will do some merging of the has the advantages of being simple to implement, with a single sensor level trajectories to obtain building level trajectories, intuitive choice of parameter (number of clusters, k), as well which will be explained later. After the trajectories are ob- as being scalable to large datasets. The k-means clustering tained, due to the noisy nature of such data collected using algorithm has been extensively studied in cases with outliers Wi-Fi probes, further filtering based on heuristics is performed as well, and it has been shown in works such as those by Im as described later on in this section. et al [39] and Bhaskara et al [40] that even in datasets with A sensor level trajectory is represented by (x ; x ; :::; x ), 1 2 n noise, a high-quality clustering can be obtained after removal in which x = (macAddress, sensorID , nextSensor , i i i of outliers, while using the k-means algorithm. Other literature startTime , endTime , stayTime , takeTime ). As each i i i i on using k-means on a noisy dataset can also be found in device’s trajectory is grouped together by MAC address, the the references [41]–[45]. In addition to k-means, hierarchical value for macAddress for each x in the same trajectory will agglomerative clustering (HAC) [46] is also a possible choice be the same. No other information about devices was retained, 4 Algorithm 1 Merging of sensor level trajectory Input Traj : original sensor level trajectory input Output Traj : merged building level trajectory new n length(Traj ) input Traj [] new for i = 1; :::; n do . Change sensor names to buildings sensorID sensorID [0] i i nextSensor nextSensor [0] i i end for for i = 1; :::; n do if takeTime < 21600 and nextSensor = i i sensorID then macAddress macAddress new i sensorID sensorID new i nextSensor nextSensor new i+1 startTime startTime new i endTime endTime new i+1 stayTime new (stayTime + takeTime + stayTime ) i i i+1 takeTime takeTime new i+1 Fig. 1: Overall framework for understanding spatiotemporal x (macAddress ; sensorID ; new new new human flow through passive Wi-Fi sensing and mining of nextSensor ; startTime ; endTime ; new new new collected data. stayTime ; takeTime ) new new else x x new i which keeps the privacy of device owners from being compro- end if mised as we cannot use the MAC address alone to identify a Traj [Traj ; x ] new new new specific individual. sensorID and nextSensor belong to i i end for the set of all sensors used in this study, fA1; A2; B1; B2; :::g. The first character of each sensor name refers to each building where the sensor was placed, while the remaining digits serve to differentiate different sensors placed at the same building. For example, A1 and A2 represent two different sensors, For the purpose of this study, we consider trajectories both placed at the building labeled ‘A’. In the code, the first at building level instead of sensor level, as each individual character of sensorID is a string variable, so the building is i building has its own purpose (such as shopping mall, hospital represented by the first character of the string, sensorID [0]. i etc) and thus any results we obtain could be explained more For example, if sensorID is ‘A1’, sensorID [0] would be i i intuitively as compared to individual sensor levels. To get ‘A’. sensorID represents the ID of the ith sensor in the i the building level trajectories from sensor level trajectories, trajectory, while nextSensor represents the ID of the (i+1)th i consecutive entries detected at the same building in the same sensor. trajectory were merged if they occurred within 6 hours of each For the last element in the trajectory x , nextSensor other on the same day according to Algorithm 1. Two sensors n n will be left empty. startTime and endTime represent the i i are from the same building if the first character in the sensor start time and end time of detection at sensor i respectively. name is the same, such as A1 and A2. stayTime (time spent at the current sensor) and takeTime i i An example of merging a trajectory is shown in Fig. 2. It (time taken to reach the next sensor after leaving current can be seen that the detections at Buildings A and C were sensor) are calculated as in Eqn. (1) and (2) respectively: merged, but the detections at Building B were not, due to exceeding the time threshold. stayTime = endTime startTime (1) i i i Other than merging, all trajectories with endTime startTime (n being the trajectory length) less than 5 minutes were discarded as these short trajectories are not informative startTime endTime if i = 1; :::; n 1 i+1 i takeTime = (2) and are therefore outside the scope of our investigation. Long 0 if i = n trajectories indicating that the device stayed at a single location Each trajectory lasts for the length of one day. Days were for more than 16 hours in a single day were similarly discarded taken from 3:00 AM on one calendar day to 3:00 AM the next, as they are deemed to be anomalies. After this filtering, our to give allowance for some commercial or social activities that dataset consisted of more than 5.7 million trajectories from may cut across midnight. around 1.6 million devices over the period of 5 months. 5 C := x (4) j i i:ind =j This is a fast and simple method to cluster data, and it works well for data with features of a similar type - in this case, all features are numbers of people detected at a location within a certain hour. To find the best value of k to set as a parameter for this clustering, we plotted the sum-of-squared error plot as shown in Fig. 3, which indicates that the ideal value for k, located at the elbow point of the curve, is between 3 and 5. After testing out these values for k, a value of k = 4 turned out to give the best balance between specificity and interpretability, as a value of 3 could be broken down further whereas the value of 5 broke down the clusters in an Fig. 2: An example of merging certain consecutive entries unintuitive way. From the results, it also largely corresponds to along a trajectory. 4 main types of days with their own characteristics - working weekdays, Fridays, Saturdays and Sundays. IV. C LUSTERING BY TIME This section explores the first aspect of trajectory data clustering, which is clustering by time. This section will be split into two parts: the results of the clustering of calendar days according to each day’s features at each location, as well as some further analysis of highlighted clustering results. A. Clustering Results Clustering by time involves the extraction of features into a vector representing each day and clustering those vectors. Each day is segmented into 24 intervals of 1 hour each, starting from 3:00 AM on one day to 3:00 AM the next day. The number Fig. 3: Plot of the sum-of-squared error values for different of people appearing at each location within that hour was values of k. extracted from the data, and this forms a 1-by-24 feature vector for each day. Each vector was normalized using min-max normalization before being subjected to k-means clustering. The aim of this clustering is to investigate the possibility of inferring temporal context such as type of day in terms of weekday as compared to weekend based on the patterns of people count at each location in the Facility area. The k-means algorithm is chosen based on its scalability to large datasets and simplicity in choice of a single parameter. The k-means algorithm uses expectation-minimization to perform clustering as described in Eqn. 3 to 4 below. Firstly, k centroids C where j = 1,...,k are initialized within the feature space of the data set. Second, each data point is Fig. 4: Representation of results of clustering by time. Weeks temporarily assigned with an index equal to the index of the 1, 2, 16, and 21 are selected for clarity. Weeks 1 and 2 show centroid nearest to it. The index of the ith data point x public holidays (PH) that are grouped into the same cluster as out of the total number of N data points is referred to as Sundays, while weeks 16 and 21 are selected as an example ind (ind 2 f1; :::; kg) and is computed using Eqn. 3. The i i of ’normal weeks’ without PH or special days. distance metric commonly used is Euclidean distance. Next, the centroids of each cluster are reassigned to the arithmetic Fig. 4 shows a plot of the results of clustering by day in mean of all the data points in each cluster as in Eqn. 4, where calendar form. In each calendar, the column represents one n refers to the number of data points in cluster j. The last day of the week from Monday through Sunday, while each two steps are then repeated until convergence, which occurs row represents the week number of the data, with week 1 when the calculated means of the clusters do not change in being the first week of data collection, and so on. subsequent steps. Weeks 1 and 2 were selected to show the cluster assignment of public holidays (PH) and PH eves. The PH in weeks 1 and ind := arg minjjx C jj (3) i i j 2 are the squares with red borders. Weeks 16 and 21 were j 6 chosen to show the cluster assignment of a ”normal” week, between 11:00 AM and 2:00 PM, and between 5:00 PM to that is, without special days like PH or PH eves. 8:00 PM. These periods of time roughly correspond to office From the calendars of Malls 1 to 4 and the Hospital, it hours and meal times, so these could show the surges of people can be seen that the clustering generally follows the four arriving at the mall to have meals within those time periods. types of days below with overall at least 70% of instances The last peak that occurs in the evening could also represent falling into each respective cluster: Cluster 1 (dark blue) the evening shoppers. However, the first peak of people count mainly has working Mondays to Thursdays, Cluster 2 (brown) is lower for Cluster 2 than Cluster 1, and the last peak is mainly contains Fridays, Cluster 3 (yellow) mainly contains much higher. There could be an increase in people going to Saturdays, and Cluster 4 (light blue) mainly contains Sundays. the mall in the evenings on Friday/PH eve to shop as compared As seen from Table 1(a) to 1(f), 100% of the available PH to working Mondays to Thursdays. in the dataset are clustered in the same cluster as Sundays For Clusters 3 and 4, the curves are generally smooth, (Cluster 4), while PH eves are clustered in the same cluster increasing sharply in the late morning and plateauing across as Fridays (Cluster 2) in a proportion of 50% and above for the middle of the day before dropping back down to zero five out of the six buildings. after 10:00 PM. Cluster 3’s curve gently increases from 1:00 Now that we know that different types of days follow PM to 7:00 PM, reaching its highest point at 7:00 PM, while generally different patterns, the actual patterns of each day Cluster 4’s curve lacks a noticeable increase in the evening. are then investigated in the next subsection. This could be because people are more likely to stay out later in the evenings on Saturdays as they do not need to go to work the following morning, compared to Sundays when they do. B. Further Analysis of Clustering Results To examine how the actual patterns of people count vary V. CLUSTERING BY P ERSON in different clusters, an example of people count curves on different clusters of days from Mall 2 was visualised in Fig. 5. This section explores the second aspect of trajectory data The colors correspond to the cluster assignments in Fig. 4. clustering, which is clustering by person. This section will be split into two parts: the results of the clustering of individual trajectories, and the temporal analysis of highlighted clustering results based on the analysis of the above section. A. Clustering Results Individual trajectories within a single day had their features extracted into a single vector. The features used for clustering were the accumulated time detected in a single day at the hospital, shopping mall, educational institute, the residential estate (day time), and the residential estate (night time) re- spectively. The detected time at the residential estate was split into between 7:00 AM to 7:00 PM (day time) and 7:00 PM to 7:00 AM (night time). This split was performed as many retired elderly were observed to be staying at the estate and they are hypothesized to be more active in the day time, as opposed to the working adults or school children, who are likely to be more active in the residential estate later in the day or evening. These features were then used as input for clustering via the k-means clustering algorithm. A value for k has to be predetermined for use in k- means. Traditional indices for calculating the optimal number of clusters for k-means clustering, such as the silhouette index Fig. 5: Curves of device count versus time at Mall 2 on [48] and sum-of-squared error plot, suggests an ideal value of different days, separated based on the clustering results. The 5. However, when trying to use this value, the sizes of the four clusters are mainly corresponding to 1) Working Mondays clusters are greatly skewed. We therefore manually increase k to Thursdays, 2) Fridays/Public holiday eves, 3) Saturdays, to find a suitable value for clustering and end up with a value and 4) Sundays/PH. The x-axis of each plot represents the of 8. When the value of k was originally selected as 5, the time of day while y-axis represents normalized device count. people who stayed for a longer time at the shopping malls Upon first glance, the curves for Clusters 1 (Mondays to were grouped together to form a very large cluster. After the Thursdays) and 2 (Fridays and PH Eves) are very similar to value of k was increased to 8, this large cluster was further each other and very different from Clusters 3 (Saturdays) and divided into 3 smaller clusters, each representing one more 4 (Sundays and PH), which are also similar to each other. specific type of person. Similarly, the people who stayed the For Clusters 1 and 2, they each have three peaks of people longest time at the hospital were originally grouped together count, which tend to occur between 6:00 AM to 9:00 AM, in a large cluster in the k = 5 case and this large cluster was 7 TABLE I: Percentage of actual (row) vs clustered (column) days in each separate building (a) Hospital (b) Institute Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur 51.4 48.6 0.0 0.0 0.0 0.0 Mon-Thur 51.4 42.9 0.0 2.9 2.9 0.0 Fri 16.7 83.3 0.0 0.0 0.0 0.0 Fri 35.0 65.0 0.0 0.0 0.0 0.0 PH Eve 100.0 0.0 0.0 0.0 0.0 0.0 PH Eve 0.0 0.0 100.0 0.0 0.0 0.0 Sat 0.0 0.0 0.0 100.0 0.0 0.0 Sat 11.1 0.0 0.0 72.2 16.7 0.0 Sun 0.0 0.0 0.0 0.0 100.0 0.0 Sun 0.0 0.0 0.0 21.4 78.6 0.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 (c) Mall 1 (d) Mall 2 Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur 71.8 26.9 0.0 0.0 1.3 0.0 Mon-Thur 94.7 4.0 0.0 1.3 0.0 0.0 Fri 4.8 95.2 0.0 0.0 0.0 0.0 Fri 0.0 100.0 0.0 0.0 0.0 0.0 PH Eve 0.0 0.0 100.0 0.0 0.0 0.0 PH Eve 50.0 0.0 50.0 0.0 0.0 0.0 Sat 0.0 0.0 0.0 76.2 23.8 0.0 Sat 0.0 0.0 0.0 100.0 0.0 0.0 Sun 0.0 0.0 0.0 11.1 88.9 0.0 Sun 0.0 0.0 0.0 16.7 83.3 0.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 (e) Mall 3 (f) Mall 4 Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur 79.5 20.5 0.0 0.0 0.0 0.0 Mon-Thur 83.1 16.9 0.0 0.0 0.0 0.0 Fri 9.5 90.5 0.0 0.0 0.0 0.0 Fri 5.3 94.7 0.0 0.0 0.0 0.0 PH Eve 50.0 0.0 50.0 0.0 0.0 0.0 PH Eve 50.0 0.0 50.0 0.0 0.0 0.0 Sat 0.0 0.0 0.0 95.2 4.8 0.0 Sat 0.0 0.0 0.0 90.0 10.0 0.0 Sun 0.0 0.0 0.0 0.0 100.0 0.0 Sun 0.0 0.0 0.0 17.6 82.4 0.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 divided into 2 smaller clusters after k was increased to 8. As (blue). Each cluster will be discussed further in detail. a result, k = 8 is selected over 5 because it produces a more CP1 to CP4 are the clusters in which most, if not all, balanced and informative clustering result. This value was used trajectories have visited a shopping mall at least once. The to perform k-means clustering on the feature vectors. The aim amount of time stayed at a shopping mall per trajectory of this portion is to search for clusters of device trajectories increases in order from CP1 (less than 1 hour) to CP4 (7.5 and, from there, infer insights about a given device based on to 15 hours). The transition probability diagrams show heavy the cluster that its trajectory is assigned to. The results of the emphasis on the buildings in the Facility area (top right). The clustering and each part of our proposed analysis is shown most common transitions are those between Malls 2 and 3, in Fig. 6. Each cluster is labeled as ‘CP’, which stands for followed by the transitions between Mall 3 and the Hospital, Cluster of People, together with its corresponding number. as well as between Malls 2 and 4. All four of these clusters Row (a) in Fig. 6 contains the visualizations of the tra- also show a similar pattern in terms of number of nodes visited jectories of the individual clusters. The colors represent the per trajectory peaking at two, however in CP4 there is a much different types of buildings as stated in the legend. Row (b) higher probability of a trajectory containing a single building contains the plots of transition probabilities between each pair as compared to the other three clusters. The main difference of nodes within each cluster, in order to study the movement in these clusters lie in the distributions of start and end times. patterns. There are 20 nodes in total, with 6 from the Facility CP1 has three peaks for the start time (8:00 AM, 12:00 PM, area and 14 individual buildings from the Residential area, 6:00 PM), while CP2 and CP3 have two peaks (12:00 PM labeled alphabetically from ’a’ to ’n’. Row (c) contains the and 6:00 PM for CP2, 8:00 AM and 12:00 PM for CP3), histograms showing the distribution of number of unique and CP4 has only one (8:00 AM). These common timings nodes visited per trajectory in each cluster. Finally, Row (d) correspond with the common times to start work (8:00 AM), contains the probability distributions of start and end times of break for lunch (12:00 PM), and leave work for home (6:00 trajectories in each cluster. PM). Those people who appear in the shopping mall at 6:00 From row (a), these eight clusters can be broken down PM and stay there for at least an hour (CP2) could be going for visually into 3 groups of similar clusters - CP1 to CP4 being short shopping trips after work. Those who appear at 8:00 AM mainly at the shopping malls (gray), CP5 and CP6 being may either pass by the shopping malls on their way to work mainly at the hospital (purple), CP7 being mainly at the (CP1) or start working at the shopping mall in the morning Institute (pink), and CP8 being mainly at the Residential area (CP3 and CP4). However, since the time spent ranges from 8 Fig. 6: Illustrations of results from clustering data by person. (a) A visual representation of the eight clusters obtained using k-means clustering of trajectory stay times at differing locations. The horizontal axis represents the time of day from 3:00 AM to 3:00 AM the next day, while the color represents the location where the device was detected, as described in the legend. Trajectories are sorted by the time at which they were first detected. (b) Visualizations of transition probability between each pair of nodes within each cluster, for comparing the movement patterns between clusters. A darker and thicker line connecting a pair of nodes means that there are more transitions between those nodes. (c) Histograms describing the number of unique locations visited per trajectory within each cluster. The y-axis has been subjected to probability normalization. (d) Probability distributions of start times and end times of trajectories in each cluster. The x-axis represents the time of day and the y-axis represents the probability that a trajectory in the respective cluster starts or ends within that timing with a resolution of one hour. between 3.5 and 7.5 hours (CP3) as compared to between 7.5 than CP5. This indicates that the people in CP6 start and end and 15 hours (CP4), it can be inferred that trajectories in CP4 their trajectories over a much narrower time period than CP5, are more likely to belong to people working in the shopping which is in line with the consideration that people in CP6 may malls for long hours, while those in CP3 could belong to either be workers at the Hospital, while people in CP5 may be either people with shorter shifts, or long shopping trips. shift workers or visitors to the Hospital. CP7 is the cluster that contains trajectories with the longest CP5 and CP6 are the clusters in which all the trajectories time at the Institute, from 3.5 to 14.5 hours. Its corresponding have visited the Hospital and stayed for a relatively long time transition probability diagram is noticeably different from the (2 to 6.5 hours for CP5 and 7.5 to 15 hours for CP6). The others, with strong links between the Hospital and the Institute, transition probability diagrams both show strong links between as well as between Mall 3 and the Hospital. These strong links Mall 3 and Hospital, as well as Mall 4 and Hospital. Since may also be explained as above, due to the proximity of Mall Malls 3 and 4 are connected to a transport hub, these high 3 to the transport hub. Most of the trajectories in this cluster number of transitions could reflect the people taking public contained three unique nodes, which is different from the rest transport to and from the Facility area. The distribution of the as well, since the rest of the unique node distributions all number of nodes visited per trajectory is also similar, both peaked at two. The start and end time distributions are also peaking at two places and decreasing with increasing number unique, in that there is a single prominent sharp spike in each of places up till five or six. The main difference between these of the start and end times. The start time peak occurs at 8:00 two clusters lies in the distributions of the start and end times. AM with a probability value of almost 0.5, while the end time Both clusters have peaks in start time at 8:00 AM and 12:00 peak occurs at 6:00 PM with a value of between 0.3 to 0.4. PM, however the 8:00 AM peak for CP6 is much higher and This indicates that CP7 is likely to represent people who work the 12:00 PM peak is much lower as compared to CP5. For or study at the Institute. the ending times, both clusters have peaks at 5:00-6:00 PM and 9:00 PM, but the 6:00 PM peak is much higher for CP6 Lastly, CP8 is the cluster that contains trajectories with the 9 longest time in the Residential area (0.5 to 11 hours in the day 12:00 PM, and 5:00-6:00 PM, and one small peak at 9:00 PM. and 12 hours at night). Its corresponding transition probability However, the 12:00 PM peak for CP5 is the highest peak, while diagram has visually stronger links are between Residential the 5:00-6:00 PM peak is the highest for CP6. buildings instead of the Facility buildings, as compared to CP1 For CP5, the Saturday line peaks in the morning and at to CP7, which have very few links to the Residential buildings. 12, before decreasing gently throughout the rest of the day to There are also a few links between Residential and Facility nearly zero at 11:00 PM. In contrast, the Sunday/PH line for area, but these are considerably much less than the other CP5 lacks distinctive peaks, instead gently curving upwards clusters as well. The probability of each number of unique and then decreasing in a similar way to the Saturday line after locations per trajectory appears to decrease exponentially with 4:00 PM. On the other hand, the Saturday and Sunday/PH the number of places per trajectory increasing from two places lines for CP6 are relatively similar and both have four peaks of up till ten places, as compared to other clusters which appear approximately equal height spread throughout the day. These to be more of a parabola shape. One thing to note for the start peaks occur at 6:00 AM, 12:00 PM, 4:00 PM, and 9:00 PM. and end time distribution plot is that the probability values at The difference in the weekend behavior for these CPs, in 3:00 AM and 2:00 AM are much higher than those of other addition to the length of time spent at the Hospital for each clusters. These may reflect the trajectories of residents who cluster, supports the line of thinking that CP5 is more likely to stay at the Residential area overnight, past the cutoff time of represent visitors to the Hospital who go shopping afterwards, 3:00 AM when the day changes. while CP6 is likely to represents the people who are employed at the Hospital. B. Temporal Analysis of By-Person Clustering Results As part of a deeper analysis of the results of clustering trajectory data by person, we explored the temporal aspect of the clustered data by extracting the temporal data of each CP, similar to the extraction in Section III. We then manually grouped the vectors based on the 4 types of days, namely working Mondays to Thursdays, Fridays/PH eve, Saturdays, and Sundays/PH. We then plotted the average of each group, as well as the minimum and maximum boundaries, which is shown by the shaded area surrounding each line. Below, we highlight two results of our temporal analysis. Fig. 8: Daily average number of detected devices from CP8 recorded at the Residential area. Selected lines show com- parisons between (a) Mon-Thur and Saturday, (b) Mon-Thur and Sunday/PH. The shaded area represents the maximum and minimum bounds for each type of day. Fig. 8 depicts the results from CP8, the cluster of people recorded as having the longest stay time at the Residential area, as detected in the Residential area. Due to the length of stay at the Residential area, it is likely that they are residents of that area or have that area as a main destination for the day. Since the numbers of detected devices are very similar overall for Fig. 7: Daily average number of detected devices from (a) CP5 the four different types of days, two types of days were chosen and (b) CP6 recorded at the Shopping Malls. The shaded area in each subfigure to do a clear comparison. Fig. 7(a) shows represents the maximum and minimum bounds for each type the comparison between working Mondays to Thursdays and of day. Saturdays, while Fig. 7(b) shows the comparison between working Mondays to Thursdays and Sundays/PH. Fig. 7 depicts the results from CP5 and CP6, the top two For Fig. 7(a), it can be seen that more people are detected in clusters of people recorded as having the longest stay time the evenings on Saturdays as compared to working Mondays at the Hospital, as detected at the Shopping Malls. From the to Thursdays. This can reflect that more people from the proximity of locations, it can be inferred that most of these Residential area stay out late on Saturday nights as compared detections would be of people traveling through the Shopping to working weeknights. This makes sense if people tend to go Malls towards the Hospital, as the extremely long stay time home early when there is work the next day, as compared to at the Hospital for this cluster’s trajectories indicate a high Saturdays when there is a much lower probability of people likelihood of the trajectories having the Hospital as their main going to work on Sundays/PH. destination for the day. Upon first glance, it can be seen that for both cases, the Mon-Thur and Friday/PH Eve lines are Fig. 7(b) shows that there are more devices detected before very close to each other, and much higher than the Saturday 10:00 AM on working Mondays to Thursdays as compared and Sunday/PH lines. The Mon-Thur and Friday/PH Eve lines to on Sundays/PH, but there are more devices detected after in both cases have three large peaks as well at 7:00-8:00 AM, 10:00 AM on Sundays/PH than on working Mondays to Thurs- 10 days. This could indicate that the residents generally wake up are defined as the squared Euclidean distance between points. or leave the house later in the mornings on Sundays/PH than We hypothesize that the buildings will form roughly spherical on working Mondays to Thursdays. clusters based on the map, and Ward’s linkage is suitable for this. VI. CLUSTERING BY LOCATION X X WCV = jjx  jj (6) This section addresses the final aspect of clustering ad- i j j i:ind =j dressed in this study, which is clustering by location. The i first part of this section presents the results of clustering The above process was done separately for three different of the transition probability matrices illustrating transition time periods of a day for the whole dataset. The first selected probabilities between pairs of buildings, while the second time period was between 6:00 AM to 10:00 AM, which is part provides a further examination into the transition patterns the time when people generally have breakfast, leave their extracted from the data. houses, or arrive at their workplace. The second selected time period was between 11:00 AM to 2:00 PM, which is the time A. Clustering Results when people generally have lunch, and thus there could be a After clustering the data by temporal patterns as well as by more prominent movement around the malls. Lastly, the third selected time period was between 6:00 PM and 10:00 PM, individual patterns, the third aspect of trajectory data clustering which is generally the time when people working office hours is to look at the spatial patterns. The number of detected tran- leave work, have dinner, or return home. The results of HAC sitions between each pair of nodes is extracted and compiled by location for each time period are shown in Fig. 9. From into a transition probability matrix, where each row denotes the a general overview, it can be easily seen that the clustering probability of people moving out of the corresponding source results differ for each time of the day. node, and each column denotes the probability of people moving towards the corresponding destination node. For this In Fig. 8(a), the dendrograms show that the groupings of matrix, we considered 20 locations - 6 from the Facility area similar buildings change with different timings of the day. as well as 14 individual buildings from the Residential area. The first six labels of each dendrogram represent the Malls, The transition probability for each entry of the matrix were Institute and Hospital. They are grouped into pairs in the calculated using the below equation: morning and afternoon, while they are grouped in threes in the evening. The yellow and purple pairs also change members N(i; j) T (i; j) = (5) between morning and afternoon. For a better illustration of N(i; k) k2[1;20];k6=i the groupings with respect to the building map, a simplified version is shown in Fig. 8(c). The remaining 14 labels of each where T (i; j) refers to the entry of the transition probability dendrogram represent the buildings in the Residential area. matrix in row i and column j for i 6= j, and N(i; j) refers to These also differ throughout the day, and an illustration can the number of transitions observed in the data moving from be seen in Fig. 8(d). The buildings that were from a different node i to node j for i 6= j. The diagonal entries of the matrix, grouping in the previous time period have bolded outlines. The indicating probability of each node going back to itself, were differences between the groupings of the Residential area are then set to 1 to fill up the matrix. As the entries in the input described in the following paragraph. matrix are distances rather than coordinates in a feature space, the use of k-means clustering is less suitable. Thus, in this There is one cluster in the Residential area that stays the case, the transition probability matrix was then subject to HAC same throughout all three time periods, represented by the as described in [46]. orange cluster. The buildings that this cluster corresponds to HAC is an algorithm designed to cluster data points together are located on one end of the Residential area and thus they based on a given distance matrix. Many different types of may have similar transition probabilities that differ from the rest of the Residential area. The pink cluster stays the same linkages can be used such as average linkage, single linkage, through the morning and afternoon, but has an additional Ward linkage, and so on. The result of HAC can be visualized member in the evening that was originally part of the green in the form of a dendrogram, which has all the nodes at the cluster. The red cluster present in the morning was grouped bottom as separate ’leaves’, which are joined together pairwise with the dark blue cluster for the rest of the day. by ’branches’, until all the clusters are joined together at the very top. The rows and columns of the transition probability In Fig. 8(b), the transition probability matrices offer more matrix would then be rearranged simultaneously according insight on the general flow of people around the area. One to the output dendrogram. In this case, Ward’s method [49] prominent observation is that although there is a visible prob- is used to calculate the linkages between different clusters ability that people from the Residential area move towards the and data points at each level. Ward’s method is an objective Facility area, there is a very low probability of them moving function approach involving the pairing of clusters at each in the opposite direction. A possible reason for this is that step that results in the minimum increase in the total within- the people coming from the Residential area only contribute cluster variance after merging. The total within-cluster vari- to a small percentage of the overall number of visitors to the ance (WCV) is shown in Eqn. 6, where x represents data point Facility area, and as a result there is a much larger number i, j represents the cluster number, and  represents the mean of transitions between the Facility buildings as compared to of all the points within cluster j. The initial cluster distances the number of transitions moving from the Facility buildings 11 Fig. 9: Diagrams showing different clustering arrangements of locations over the course of the day. (a) Dendrograms produced from HAC, using the transition probability matrices in (b) as input. (c) Corresponding map of locations in the Facility area. (d) Corresponding map of locations in Residential area. Buildings in grey do not have sensors installed. Buildings that changed from a different grouping in the previous time period have bolded outlines, e.g. afternoon different from morning. 12 toward the Residential buildings. This is largely consistent arrows. Subsequently, the number of dominant flows towards throughout the three time periods. As this large discrepancy Mall 1 decreases in the afternoon, and finally, in the evening, makes it difficult to directly identify dominant directions from all the dominant directions are flowing outwards from Mall 1, Fig. 8(b), a clearer visualization of dominant directions will as shown by the blue arrows. be provided later in Fig. 10. Next, Fig. 10(b) shows an example of continuous flow Another observation is that the direction of the largest towards a specific building throughout the day. The building of transition probabilities, represented by the darkest squares, interest in this figure is Building j, in the Residential area. At change over time in the Facility area. In the mornings, the all three time periods, the dominant outward flow (indicated main direction is from Malls 3 and 4 toward the Hospital, as by the blue arrow) is always towards its neighboring building, well as from Mall 2 to Mall 4. For the afternoon, the transition Building k, in the Residential area. probabilities between Malls 2 and 3 are observably higher Lastly, Fig. 10(c) shows an example of continuous flow in a than those in the morning, and they are higher still in the general direction throughout the day. The building of interest evening. There is also a larger transition probability in both is Building h, another building in the Residential area. Its directions between the Hospital and Mall 3, as well as a higher inflows, shown by red arrows, tend to come from the left and probability of transition from the Institute to the Hospital. For lower parts on the map, while the outflows, shown by blue the evening, two of the main directions are reversed from the arrows, tend to go towards the top and right. This pattern morning, coming from Mall 4 to Mall 2, and from Hospital appears in all different time periods of the day as well. to Mall 3. This could mean that there is an outflow from the transportation hub to the rest of the Facility area in the VII. C ONCLUSION morning, while the flow is opposite in the evening. This may in turn indicate that the bulk of these detected transitions come In this paper, we proposed a systematic approach to analyze from people who work within the Facility area during office trajectory data obtained through passive Wi-Fi sensing. We hours. used two unsupervised machine learning techniques, k-means clustering and HAC, to examine three different aspects of As for the Residential area, it can be seen from Fig. 8(b) clustering of trajectory data, namely by time, by person, that there is a noticeably high probability of arriving at building b from several other buildings since the column of the and by location. In doing so, we observed patterns of daily transition probability matrices corresponding to building b has movement such as the fluctuation of people count over the mostly darker squares than the rest. This trend shows minimal course of the day, clusters of trajectories belonging to different change throughout the time periods. However, there is a lower types of people, as well as the relative volumes of flow probability of transitions from the Residential buildings to the between different buildings. We also provided various ways Facility buildings in the evening period as compared to the of visualization for the clustering results. morning and afternoon periods. For future work, the proposed approach can be performed on datasets gathered from different locations, such as for different residential estates and facilities, and a comparison can be done B. Analysis of Transition Patterns to identify differences such as the difference in trajectory In order to have a deeper analysis of the spatial patterns, the patterns of people living in mature estates as compared to dominant transition directions between each pair of buildings newer estates, or estates with different proximities to different were plotted in Fig. 10. A direction of transition between a sets of facilities. The findings could help in future work related pair of buildings is considered dominant if it occurs with a to urban planning in the following ways. Clustering by time probability higher than 0.55. This probability is calculated by gives insights on the daily footfall in various buildings on each taking the number of transitions in one direction and dividing type of day, therefore event planning could be more informed, by the sum of the number of transitions in both directions. or building tenant proportions could be adjusted if deemed to At first glance of Fig. 10, one observation that stands out is have an effect on the footfall. Clustering by person gives an that in the morning, a large block of the dominant directions idea of rough proportions of travel patterns of users living in start from the Residential area and lead to the Facility area, an estate, and facilities within the estate can also be adjusted while it is the reverse case in the evening. This agrees with a based on need. Finally, clustering by location can help in the general understanding that humans will go out to work in the planning of the land use in upcoming estates, depending on morning and return home in the evening. how the users flow from building to building. The rest of the buildings’ flows are analyzed through plot- ting the dominant directions on maps, focusing on the inflow ACKNOWLEDGMENT and outflow of one building at a time, called the building of This research is supported by the Singapore Ministry of interest. Patterns can be more easily identified visually, and National Development and the National Research Foundation, three such patterns are highlighted in Fig. 11. These include Prime Minister’s Office under the Land and Liveability Na- reversal of flow at different times of day, continuous flow tional Innovation Challenge (L2 NIC) Research Programme towards a specific building, and continuous flow in a general direction. Firstly, in Fig. 10(a), the building of interest is Mall (L2 NIC Award No. L2NICTDF1-2017-4). 1, located in the Facility area. In the morning, the dominant Any opinion, findings, and conclusions or recommendations flows are all inwards towards Mall 1, as shown by the red expressed in this material are those of the author(s) and do not 13 Fig. 10: A plot of dominant directions with probability larger than 0.55. Dominant directions are shown in black. REFERENCES [1] Y. Zheng, Y. Liu, J. Yuan, and X. Xie, “Urban computing with taxicabs,” in Proceedings of the 13th International Conference on Ubiquitous Computing. ACM, 2011, pp. 89–98. [2] R. A. Becker, R. Caceres, K. Hanson, J. M. Loh, S. Urbanek, A. Var- shavsky, and C. Volinsky, “A tale of one city: Using cellular network data for urban planning,” IEEE Pervasive Computing, vol. 10, no. 4, pp. 18–26, 2011. [3] Y. Mowafi, A. Zmily, I. Dhiah el Diehn, and D. Abu-Saymeh, “Tracking human mobility at mass gathering events using WISP,” in Second Inter- national Conference on Future Generation Communication Technologies (FGCT 2013). IEEE, 2013, pp. 157–162. [4] D. Zhang, L. Guo, L. Nie, J. Shao, S. Wu, and H. T. Shen, “Targeted ad- vertising in public transportation systems with quantitative evaluation,” ACM Transactions on Information Systems (TOIS), vol. 35, no. 3, p. 20, [5] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi, “Understanding individual human mobility patterns,” nature, vol. 453, no. 7196, p. 779, [6] R. Becker, R. Caceres, ´ K. Hanson, S. Isaacman, J. M. Loh, M. Martonosi, J. Rowland, S. Urbanek, A. Varshavsky, and C. Volinsky, “Human mobility characterization from cellular network data,” Commu- nications of the ACM, vol. 56, no. 1, pp. 74–82, 2013. [7] S. Jiang, J. Ferreira, and M. C. Gonzalez, “Activity-based human mobility patterns inferred from mobile phone data: A case study of Singapore,” IEEE Transactions on Big Data, vol. 3, no. 2, pp. 208–219, [8] M. Zhang, H. Fu, Y. Li, and S. Chen, “Understanding urban dynamics from massive mobile traffic data,” IEEE Transactions on Big Data, vol. 5, no. 2, pp. 266–278, 2017. [9] Honglian Ma, Huchuan Lu, and Mingxiu Zhang, “A real-time effective system for tracking passing people using a single camera,” in 2008 7th World Congress on Intelligent Control and Automation, 2008, pp. 6173– [10] R. Eshel and Y. Moses, “Tracking in a dense crowd using multiple cameras,” International Journal of Computer Vision, vol. 88, no. 1, pp. 129–143, 2010. [11] A. V. Kurilkin, O. O. Vyatkina, S. A. Mityagin, and S. V. Ivanov, “Evaluation of urban mobility using surveillance cameras,” Procedia Computer Science, vol. 66, pp. 364–371, 2015. Fig. 11: Highlighted spatial results of analysis of dominant [12] A. Galati and C. Greenhalgh, “Human mobility in shopping mall flows over different time periods. (a) Reversal of flows. (b) environments,” in Proceedings of the Second International Workshop Continuous outward flow to one specific building. (c) Contin- on Mobile Opportunistic Networking, 2010, pp. 1–7. [13] Y. Yoshimura, S. Sobolevsky, C. Ratti, F. Girardin, J. P. Carrascal, J. Blat, uous flow in a general direction. and R. Sinatra, “An analysis of visitors’ behavior in the louvre museum: A study using bluetooth data,” Environment and Planning B: Planning and Design, vol. 41, no. 6, pp. 1113–1131, 2014. [14] N. Nunes, M. Ribeiro, C. Prandi, and V. Nisi, “Beanstalk: a community reflect the views of the Singapore Ministry of National Devel- based passive wi-fi tracking system for analysing tourism dynamics,” in opment and National Research Foundation, Prime Minister’s Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Office, Singapore. Computing Systems. ACM, 2017, pp. 93–98. 14 [15] M. Traunmueller, N. Johnson, A. Malik, and C. E. Kontokosta, “Digital [37] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Traces: Modeling Urban Mobility using Wifi Probe Data,” in 6th Ordering points to identify the clustering structure,” ACM Sigmod International Workshop on Urban Computing, ACM KDD, 2017. record, vol. 28, no. 2, pp. 49–60, 1999. [16] M. W. Traunmueller, N. Johnson, A. Malik, and C. E. Kontokosta, [38] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on “Digital footprints: Using WiFi probe and locational data to analyze Information Theory, vol. 28, no. 2, pp. 129–137, 1982. human mobility trajectories in cities,” Computers, Environment and [39] S. Im, M. M. Qaem, B. Moseley, X. Sun, and R. Zhou, “Fast noise Urban Systems, vol. 72, pp. 4–12, 2018. removal for k-means clustering,” arXiv preprint arXiv:2003.02433, [17] S. Van der Spek, J. Van Schaick, P. De Bois, and R. De Haan, “Sensing [40] A. Bhaskara, S. Vadgama, and H. Xu, “Greedy sampling for approximate human activity: GPS tracking,” Sensors, vol. 9, no. 4, pp. 3033–3055, 2009. clustering in the presence of outliers,” in Advances in Neural Information Processing Systems, 2019, pp. 11 148–11 157. [18] S. H. Marakkalage, S. Sarica, B. P. L. Lau, S. K. Viswanath, T. Bala- [41] W. Tang and T. M. Khoshgoftaar, “Noise identification with the k- subramaniam, C. Yuen, B. Yuen, J. Luo, and R. Nayak, “Understanding means algorithm,” in 16th IEEE International Conference on Tools with the lifestyle of older population: Mobile crowdsensing approach,” IEEE Artificial Intelligence. IEEE, 2004, pp. 373–378. Transactions on Computational Social Systems, vol. 6, no. 1, pp. 82–95, [42] S. Ben-David and N. Haghtalab, “Clustering in the presence of back- ground noise,” in International Conference on Machine Learning, 2014, [19] T. Hu, E. Bigelow, J. Luo, and H. Kautz, “Tales of two cities: pp. 280–288. Using social media to understand idiosyncratic lifestyles in distinctive [43] W. S. Manjoro, M. Dhakar, and B. K. Chaurasia, “Operational analysis metropolitan areas,” IEEE Transactions on Big Data, vol. 3, no. 1, pp. of k-medoids and k-means algorithms on noisy data,” in 2016 Inter- 55–66, 2016. national conference on communication and signal processing (ICCSP). [20] N. M. Yip, R. Forrest, and S. Xian, “Exploring segregation and mobil- IEEE, 2016, pp. 1500–1505. ities: Application of an activity tracking app on mobile phone,” Cities, [44] B. Schelling and C. Plant, “KMN – Removing noise from k-means vol. 59, pp. 156–163, 2016. clustering results,” in International Conference on Big Data Analytics [21] L. Schauer, M. Werner, and P. Marcus, “Estimating crowd densities and Knowledge Discovery. Springer, 2018, pp. 137–151. and pedestrian flows using wi-fi and bluetooth,” in Proceedings of [45] Z. He and C. Yu, “Clustering stability-based evolutionary k-means,” Soft the 11th International Conference on Mobile and Ubiquitous Systems: Computing, vol. 23, no. 1, pp. 305–321, 2019. Computing, Networking and Services, 2014, pp. 171–177. [46] D. Mullner ¨ , “Modern hierarchical, agglomerative clustering algorithms,” [22] P. Sapiezynski, A. Stopczynski, R. Gatej, and S. Lehmann, “Tracking arXiv preprint arXiv:1109.2378, 2011. human mobility using wifi signals,” PloS one, vol. 10, no. 7, 2015. [47] K. Li, C. Yuen, S. S. Kanhere, K. Hu, W. Zhang, F. Jiang, and X. Liu, [23] Y. Chon, S. Kim, S. Lee, D. Kim, Y. Kim, and H. Cha, “Sensing “An Experimental Study for Tracking Crowd in Smart Cities,” IEEE WiFi packets in the air: practicality and implications in urban mobility Systems Journal, 2018. monitoring,” in Proceedings of the 2014 ACM International Joint [48] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and Conference on Pervasive and Ubiquitous Computing. ACM, 2014, validation of cluster analysis,” Journal of Computational and Applied pp. 189–200. Mathematics, vol. 20, pp. 53–65, 1987. [24] M. V. Barbera, A. Epasto, A. Mei, V. C. Perta, and J. Stefa, “Signals from [49] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” the crowd: uncovering social relationships through smartphone probes,” Journal of the American Statistical Association, vol. 58, no. 301, pp. in Proceedings of the 2013 Internet Measurement Conference, 2013, pp. 236–244, 1963. 265–276. [25] A. Basalamah, “Crowd mobility analysis using wifi sniffers,” Int J Adv Comput Sci Appl, vol. 7, pp. 374–378, 2016. [26] J. McAuley, C. Roux, and J. Little, “Towards Approaches and Tech- niques for Analysing WiFi Location Data,” The 25th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, 2017. [27] A. Alessandrini, C. Gioia, F. Sermi, I. Sofos, D. Tarchi, and M. Vespe, “Wifi positioning and Big Data to monitor flows of people on a wide scale,” in 2017 European Navigation Conference (ENC). IEEE, 2017, Zann Koh received the B.Eng degree in Engineer- pp. 322–328. ing and Product Development from the Singapore [28] Y. Zhou, B. P. L. Lau, Z. Koh, C. Yuen, and B. K. K. Ng, “Understanding University of Technology and Design, Singapore, in Crowd Behaviors in a Social Event by Passive WiFi Sensing and Data 2017. She is currently pursuing the Ph.D. degree Mining,” IEEE Internet of Things Journal, 2020. with the Singapore University of Technology and [29] A. J. Ruiz-Ruiz, H. Blunck, T. S. Prentow, A. Stisen, and M. B. Design, Singapore, under Dr. Chau Yuen’s supervi- Kjærgaard, “Analysis methods for extracting knowledge from large-scale sion. Her current research interests include big data wifi monitoring to inform building facility planning,” in 2014 IEEE analysis, data discovery, urban human mobility, and International Conference on Pervasive Computing and Communications unsupervised machine learning. (PerCom). IEEE, 2014, pp. 130–138. [30] E. Kalogianni, R. Sileryte, M. Lam, K. Zhou, M. Van der Ham, S. Van der Spek, and E. Verbree, “Passive wifi monitoring of the rhythm of the campus,” in Proceedings of The 18th AGILE International Conference on Geographic Information Science, 2015, pp. 9–14. [31] J. Shen, J. Cao, X. Liu, and S. Tang, “SNOW: Detecting shopping groups using wifi,” IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3908– 3917, 2018. [32] J. Shen, J. Cao, and X. Liu, “BaG: Behavior-aware group detection in Yuren Zhou received the B.Eng. degree in Electrical crowded urban spaces using wifi probes,” IEEE Transactions on Mobile Engineering from Harbin Institute of Technology, Computing, 2020. Harbin, China in 2014, and the Ph.D. degree from [33] Y. Zhou, B. P. L. Lau, C. Yuen, B. Tunc ¸er, and E. Wilhelm, “Un- Singapore University of Technology and Design, derstanding urban human mobility through crowdsensed data,” IEEE Singapore in 2019, with a focus on data mining Communications Magazine, vol. 56, no. 11, pp. 52–59, 2018. and smart city applications. He is currently a post- [34] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward doctoral research fellow at Singapore University of feature space analysis,” IEEE Transactions on Pattern Analysis and Technology and Design. His current research inter- Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002. ests include big data analytics and its application in [35] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data urban human mobility, building energy management, clustering method for very large databases,” ACM SIGMOD Record, and Internet of Things. vol. 25, no. 2, pp. 103–114, 1996. [36] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in KDD, vol. 96, no. 34, 1996, pp. 226–231. 15 Billy Pik Lik Lau received the degree in computer science and M.Phil. degree in computer science from Curtin University, Perth, WA, Australia, in 2010 and 2014, respectively. He is currently a Ph.D. candidate with Dr. Chau Yuen at the Singapore University of Technology and Design, Singapore. He studied the cooperation rate between agents in multiagents systems during master studies. His current research interests include urban science, big data analysis, data knowledge discovery, Internet of Things, and unsupervised machine learning. Chau Yuen is currently an Associate Professor at Singapore University of Technology and Design. He received the B.Eng. and Ph.D. degrees from Nanyang Technological University, Singapore, in 2000 and 2004, respectively. He was a Postdoctoral Fellow at Lucent Technologies Bell Labs, Murray Hill, NJ, USA, in 2005. He was a Visiting Assistant Professor at The Hong Kong Polytechnic University in 2008. From 2006 to 2010, he was a Senior Research Engineer at the Institute for Infocomm Research (I2R, Singapore), where he was involved in an industrial project on developing an 802.11n Wireless LAN system, and participated actively in 3Gpp Long Term Evolution (LTE) and LTE-Advanced (LTE-A) Standardization. He has been with the Singapore University of Technology and Design since 2010. He is a recipient of the Lee Kuan Yew Gold Medal, the Institution of Electrical Engineers Book Prize, the Institute of Engineering of Singapore Gold Medal, the Merck Sharp and Dohme Gold Medal, and twice the recipient of the Hewlett Packard Prize. He received the IEEE Asia-Pacific Outstanding Young Researcher Award in 2012. He serves as an Editor for the IEEE Transaction on Communications and the IEEE Transactions on Vehicular Technology and was awarded the Top Associate Editor from 2009 to 2015. Bige Tunc ¸er is an associate professor at the Ar- chitecture and Sustainable Design Pillar of Singa- pore University of Technology and Design (SUTD), where she founded the Informed Design Lab. The lab’s research focuses on data collection, information and knowledge modeling and visualization, for in- formed architectural and urban design. She received her PhD in Architecture from Delft University of Technology (TU Delft), her MSc (computational design) from Carnegie Mellon University, and her BArch from Middle East Technical University. She was an assistant professor at TU Delft, a visiting professor at ETH Zurich, a visiting scholar at MIT, and a visiting professor at Computer Engineering Department of University of Pavia, Italy. Her research interests include evidence based design, big data informed urban design, and design thinking. She leads and participates in various large multi-disciplinary research projects in evidence informed design, IoT, and big data. Keng Hua Chong is Associate Professor of Ar- chitecture and Sustainable Design at the Singa- pore University of Technology and Design (SUTD), where he directs the Social Urban Research Groupe (SURGe) and co-leads the Opportunity Lab (O-Lab). His research on social architecture particularly in the areas of ageing population, liveable place and data-driven collaborative design has led to several key publications and projects, including Creative Ageing Cities, Second Beginnings, and the New Urban Kampung Research Programme. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Multiple-Perspective Clustering of Passive Wi-Fi Sensing Trajectory Data

Loading next page...
 
/lp/arxiv-cornell-university/multiple-perspective-clustering-of-passive-wi-fi-sensing-trajectory-8AgRljtvUQ
ISSN
2332-7790
eISSN
ARCH-3347
DOI
10.1109/TBDATA.2020.3045154
Publisher site
See Article on Publisher Site

Abstract

Multiple-Perspective Clustering of Passive Wi-Fi Sensing Trajectory Data Zann Koh, Yuren Zhou*, Member, IEEE, Billy Pik Lik Lau, Student Member, IEEE Chau Yuen, Senior Mem- ber, IEEE, Bige Tunc ¸er, and Keng Hua Chong Abstract—Information about the spatiotemporal flow of hu- Camera-based tracking is usually limited to sparse crowds [9] mans within an urban context has a wide plethora of applications. and a small coverage area [10]. As limitations based on line- Currently, although there are many different approaches to of-sight will be inherent in all camera-based tracking methods, collect such data, there lacks a standardized framework to we will look into using a different method of tracking human analyze it. The focus of this paper is on the analysis of the micro-mobility to avoid this. Alternative methods of tracking data collected through passive Wi-Fi sensing, as such passively collected data can have a wide coverage at low cost. We passively include the passive sniffing of Bluetooth and Wi-Fi propose a systematic approach by using unsupervised machine signals from mobile phones. learning methods, namely k-means clustering and hierarchical Passive sniffing of Bluetooth signals has been used in agglomerative clustering (HAC) to analyze data collected through contexts such as detecting human behavior in shopping mall such a passive Wi-Fi sniffing method. We examine three aspects environments [12] as well as in public spaces such as the of clustering of the data, namely by time, by person, and by location, and we present the results obtained by applying our Louvre [13]. However, Bluetooth devices mainly operate in proposed approach on a real-world dataset collected over five the range of 10 meters or so. This upper limit causes passive months. Bluetooth sensing to be less feasible for coverage of a large area such as a residential estate, as it would theoretically Index Terms—passive Wi-Fi sensing, Wi-Fi sniffing, data min- ing, spatiotemporal, clustering. require the placement and maintenance of a large number of sensors throughout the course of the study. With Wi-Fi signals being detectable at a comparatively larger range of tens of I. I NTRODUCTION meters, passive sensing of Wi-Fi signals was thus chosen as NDERSTANDING the spatiotemporal flow of humans the method to focus on for this paper. within urban areas is a useful undertaking as it has Within the available literature which makes use of passively applications in fields such as urban planning [1], [2], crowd detected Wi-Fi signals to make mobility-related inferences, monitoring [3], and targeted advertising [4]. Such tracking of there are some common methods of analysis. Most of them human flow can come under two categories: active and passive. present the counts of detected devices over time, as well as Active tracking involves obtaining data from participants who heatmaps superimposed over maps of the region of detection. are actively engaged in the data collection process. One diffi- Some works [14]–[16] also have illustrations of the strength of culty present in the use of such active data collection methods each direct connection between sensors as a form of analysis is that many countries have policies put into place to protect of flow between each sensor. the privacy of users, which makes such data unobtainable from However, previous literature on the use of Wi-Fi signals in apps without users’ consent. On top of these policies, it is mobility tracking lack the in-depth use of machine learning difficult to obtain such consent and active participation from methods to examine passively collected Wi-Fi data in detail. a large number of people. The main challenges and research gap are as follows: In contrast to active tracking, passive tracking collects infor- Lack of labels for Wi-Fi data - as passively collected mation about people without the need for active participation. Wi-Fi data is collected without the subjects’ knowledge, Under the category of passive tracking, some currently used there is no way to verify the accuracy of each person’s methods include cellular activity tracking [5]–[8], and the time and location data with the person themselves. use of cameras [9]–[11]. However, these methods may have Lack of systematic approach in previous literature - in certain drawbacks under certain conditions. For instance, the previous literature, there is a lack of a standardized resolution of cellular methods depends on the density of cell systematic approach to address such passively collected towers in a region, which is usually in the scale of kilometers Wi-Fi data. and would be too coarse to measure micro-mobility of humans. Noise present in Wi-Fi data - there will be noise present in real-world data when collecting it, therefore we have *Corresponding author to find some way to clean it in order to extract useful Zann Koh, Yuren Zhou, Billy Pik Lik Lau, and Chau Yuen are with the Pillar insights. of Engineering and Product Development, Singapore University of Technology and Design, Singapore. (e-mail: yuren zhou@sutd.edu.sg) In this paper, different from previous literature, we use Bige Tunc ¸er and Keng Hua Chong are with the Pillar of Architecture clustering, a type of unsupervised machine learning technique, and Sustainable Design, Singapore University of Technology and Design, Singapore. to analyze a large real-world dataset and thus obtain more arXiv:2012.11796v1 [cs.LG] 22 Dec 2020 2 insights than those obtained through the above methods alone. evidenced by [20], in which participants were asked through Clustering allows us to find groups of similar patterns within letters to run an app for seven days which would log their unlabeled data. As will be shown in the following sections, GPS location, and they were required to input locations where when applying clustering to Wi-Fi sensing data, one can obtain they stayed for over half an hour. That experiment had a detailed multi-angle patterns related to people’s behaviors in 1.33% participation rate (45 individuals consented out of a certain residential or commercial district, which can be very 3380 requests sent out). In addition, due to high reliance on helpful for district design and management. Examples include individual users’ compliance in self-reporting, data obtained the clustering by time in Section IV, where the results can through active methods may have some intrinsic bias and is be used to check the similarities of people flow in different likely to be scarcer than desired. Thus, for this study, we have buildings across different days. This can help to inform the chosen to use a passive tracking method. planning decisions of large-scale events in different buildings Other than active tracking methods, there are also methods to avoid congestion, or to organize a roadshow during certain to passively track human mobility, such as cellular data periods of time to capture a large number of passersby. The tracking using cell towers [5]–[8], the use of cameras [9]–[11], results of clustering by person as described in Section V can and passive sensing of mobile phone signals such as Bluetooth break down the population into a few groups with distinct and Wi-Fi. behaviors, which can possibly have applications in marketing. While cell towers are readily present in the infrastructure of Lastly, clustering by location as described in Section VI the country and the user base for collection of data is large, the can help local authorities or urban planners detect whether data collected is difficult to access as it requires going through the facilities within a residential estate are being utilized as the service providers. Cell tower data also has a granularity in planned, as well as track common routes of human travel the range of kilometers, which is too coarse to examine the within an estate for applications in future residential estate micro-mobility of humans. As for camera-based tracking, it is projects. mainly limited to sparse crowds [9] and a small coverage area Therefore, this paper has the following contributions: [10]. Compounded with the inability to estimate the current distribution of people flowing into and out of each segment of We propose a systematic approach which applies the the video [11], it is less feasible for use in our current study, unsupervised machine learning techniques, namely k- which aims to study the micro-mobility of people within a means clustering and hierarchical agglomerative cluster- residential estate and the nearby set of buildings, which would ing (HAC) to cluster a dataset gathered via the low-cost theoretically require a considerably large number of cameras means of passive Wi-Fi sensing. to cover the entire area. We propose to analyze the data in three aspects - by time, We then turn to passive sensing of mobile phone signals. by person, and by location. When making a decision between the use of Wi-Fi signals Finally, we apply the proposed approach on a real- as compared to Bluetooth, we consider that it is more likely world dataset collected over a period of five months and for a given device to have its Wi-Fi on as compared to its covering an area of approximately 0.52 km , and provide Bluetooth [21]. Sensing of Wi-Fi signals have been used in a detailed analysis of the clustering results. previous literature and works for finer resolution as compared The structure of the remaining sections will be as follows: to cell towers, as detection of packets is limited to the radius Section II will discuss related works in detail. Section III of Wi-Fi detection, which is typically less than 100 meters. will briefly go through the steps taken for data collection and Additionally, the authors in [22] have shown that it is possible preprocessing of the obtained data. Sections IV to VI will to infer a large proportion of mobility of a population from a describe and present the results of clustering the data in three time-series of Wi-Fi signals coupled with a small number of aspects - by time, by person, and by location respectively. GPS data samples. This supports the use of passively collected Finally, Section VII will conclude the paper. Wi-Fi signals as a viable method of location detection over time. II. R ELATED WORKS In the previous literature, there have been a few works This section gives an overview of the related works divided which authors have used sensors that were mobile, for example into the subsections of mobility tracking methods and cluster- mobile phones held by volunteers [23], or laptops held by ing methods. researchers themselves [24]. The use of these sensors are manpower intensive and therefore become less feasible for A. Tracking Methods data collection of daily mobility over a long period of time. Methods of active tracking of human mobility actively For the remaining majority of works that have used sensing involve participants whose data is being tracked. Examples can of Wi-Fi signals, the sensors used mostly consist of Wi-Fi be found in [17], which used specific GPS sensing devices in access points or self-implemented infrastructures, which are tandem with interviews and questionnaires; in [18], where the stationary over the course of the data collection. This means mobility and demographic data of participants were collected that they do not require volunteer participation, and thus are through the use of a mobile application; as well as in [19], ideal for passive people counting and tracking. Out of these papers that use stationary Wi-Fi sensors for where geotagged data from a social media app was used. However, the consent of participants is difficult to obtain, their tracking, some mainly collect data for specific events especially due to privacy concerns and legislation. This is rather than for daily human movement [25]–[28]; some are 3 directed toward specific industries’ applications such as in- of algorithm, as it is also scalable to large datasets and has vestigating the movements within a hospital [29], within a an intuitive visualization in terms of the dendrogram. Users university [30], within a shopping mall [31], [32], within a are able to easily view the distance between each level of public district with shopping malls and office buildings [33], agglomerated clusters and decide on an appropriate distance and among popular tourist locations on a set of islands [14]; threshold. Thus, for this paper, we will be using the k-means and a few cover large enough parts of cities such as Manhattan clustering algorithm as well as the HAC algorithm for the and Copenhagen to identify the daily movement of commuters purposes of clustering. between work and home [15], [16], [22]. Although the scopes of these papers vary widely, they have certain similarities in III. DATA COLLECTION AND PREPROCESSING the analysis of the data. Most of them present the counts of A. Data Collection detected devices over time, as well as heatmaps superimposed Passive Wi-Fi sensors were deployed at selected locations over maps of the region of detection. Some works [14]– of interest in two areas of the city: a Facility area and a Resi- [16] also have illustrations of the strength of each direct dential area. The Facility area consists of a hospital building, connection between sensors as a form of analysis of flow an educational institution, and four shopping mall buildings between each sensor. However, these previous literature all containing offices. For ease of explanation, the shopping mall lack the systematic use of machine learning methods to gather buildings will be referred to as Malls 1 to 4, respectively. deeper insights into their mobility data. Therefore, this paper Malls 3 and 4 are located near a transportation hub, which is will propose the use of a machine learning method, clustering, hypothesised to be the main means of transport to and from to discover common mobility patterns from the data we have the Facility area from other places outside. All of the buildings collected. in the Facility area are connected with an elevated walkway, and our sensors were installed at the entrances to the buildings from the walkways. The Residential area has 38 blocks in total, B. Clustering Methods and our sensors were deployed at 14 selected locations among Within the field of machine learning, there are two main these buildings. The distance between these 2 areas is in the types: supervised learning, which requires labeled data, and range of a few hundred metres, and this Facility area would be unsupervised learning, which does not. In this case, as the the nearest commercial center to the Residential area. Sketches data we have collected lacks ground truth, we will be using of the different buildings in each area can be found in Fig. 8 an unsupervised machine learning method, more specifically for a clear understanding. clustering. Clustering methods find groups in the input data The sensors passively collect Wi-Fi probe packets sent from based on input parameters and a distance measure. nearby mobile devices. Each sensor is a box containing a As our data size is large, we would require the algorithms Wi-Fi sniffer, which is built on top of the Raspberry Pi we use to be scalable. Some algorithms are not scalable, such Model B with additional WI-PI USB dongle for probe request as Mean Shift clustering [34], which requires multiple nearest collection. Local processing of probe requests is performed to neighbor searches during the execution of the algorithm, and reduce the volume of uploaded data, in turn reducing costs of Balanced Iterative Reducing and Clustering using Hierarchies data transmission and storage. This local processing involves (BIRCH) [35], which does not scale well to high dimensional combining probe requests with a chronological separation of data. Thus, these algorithms are less ideal for this study. 3 minutes or less [47]. The processed data is then uploaded to Other clustering algorithms such as Density-Based Spatial the cloud via cellular connection. A flowchart illustrating the Clustering of Applications with Noise (DBSCAN) [36] and above process can be found in Fig. 1. Ordering Points to Identify the Clustering Structure (OPTICS) [37] are computationally more complex and requires more B. Preprocessing complicated, potentially iterative parameter selection. Addi- tionally, as we hope to potentially discover unusual mobil- The collected data has to be further processed using the ity patterns, it would be more difficult to tune DBSCAN’s method in our earlier work [47] to obtain the trajectory of parameters to obtain explainable and meaningful clusters, as a particular mobile device with a unique MAC address. The compared to k-means. trajectories obtained using the above method are sensor level The k-means clustering algorithm [38], on the other hand, trajectories. For this study, we will do some merging of the has the advantages of being simple to implement, with a single sensor level trajectories to obtain building level trajectories, intuitive choice of parameter (number of clusters, k), as well which will be explained later. After the trajectories are ob- as being scalable to large datasets. The k-means clustering tained, due to the noisy nature of such data collected using algorithm has been extensively studied in cases with outliers Wi-Fi probes, further filtering based on heuristics is performed as well, and it has been shown in works such as those by Im as described later on in this section. et al [39] and Bhaskara et al [40] that even in datasets with A sensor level trajectory is represented by (x ; x ; :::; x ), 1 2 n noise, a high-quality clustering can be obtained after removal in which x = (macAddress, sensorID , nextSensor , i i i of outliers, while using the k-means algorithm. Other literature startTime , endTime , stayTime , takeTime ). As each i i i i on using k-means on a noisy dataset can also be found in device’s trajectory is grouped together by MAC address, the the references [41]–[45]. In addition to k-means, hierarchical value for macAddress for each x in the same trajectory will agglomerative clustering (HAC) [46] is also a possible choice be the same. No other information about devices was retained, 4 Algorithm 1 Merging of sensor level trajectory Input Traj : original sensor level trajectory input Output Traj : merged building level trajectory new n length(Traj ) input Traj [] new for i = 1; :::; n do . Change sensor names to buildings sensorID sensorID [0] i i nextSensor nextSensor [0] i i end for for i = 1; :::; n do if takeTime < 21600 and nextSensor = i i sensorID then macAddress macAddress new i sensorID sensorID new i nextSensor nextSensor new i+1 startTime startTime new i endTime endTime new i+1 stayTime new (stayTime + takeTime + stayTime ) i i i+1 takeTime takeTime new i+1 Fig. 1: Overall framework for understanding spatiotemporal x (macAddress ; sensorID ; new new new human flow through passive Wi-Fi sensing and mining of nextSensor ; startTime ; endTime ; new new new collected data. stayTime ; takeTime ) new new else x x new i which keeps the privacy of device owners from being compro- end if mised as we cannot use the MAC address alone to identify a Traj [Traj ; x ] new new new specific individual. sensorID and nextSensor belong to i i end for the set of all sensors used in this study, fA1; A2; B1; B2; :::g. The first character of each sensor name refers to each building where the sensor was placed, while the remaining digits serve to differentiate different sensors placed at the same building. For example, A1 and A2 represent two different sensors, For the purpose of this study, we consider trajectories both placed at the building labeled ‘A’. In the code, the first at building level instead of sensor level, as each individual character of sensorID is a string variable, so the building is i building has its own purpose (such as shopping mall, hospital represented by the first character of the string, sensorID [0]. i etc) and thus any results we obtain could be explained more For example, if sensorID is ‘A1’, sensorID [0] would be i i intuitively as compared to individual sensor levels. To get ‘A’. sensorID represents the ID of the ith sensor in the i the building level trajectories from sensor level trajectories, trajectory, while nextSensor represents the ID of the (i+1)th i consecutive entries detected at the same building in the same sensor. trajectory were merged if they occurred within 6 hours of each For the last element in the trajectory x , nextSensor other on the same day according to Algorithm 1. Two sensors n n will be left empty. startTime and endTime represent the i i are from the same building if the first character in the sensor start time and end time of detection at sensor i respectively. name is the same, such as A1 and A2. stayTime (time spent at the current sensor) and takeTime i i An example of merging a trajectory is shown in Fig. 2. It (time taken to reach the next sensor after leaving current can be seen that the detections at Buildings A and C were sensor) are calculated as in Eqn. (1) and (2) respectively: merged, but the detections at Building B were not, due to exceeding the time threshold. stayTime = endTime startTime (1) i i i Other than merging, all trajectories with endTime startTime (n being the trajectory length) less than 5 minutes were discarded as these short trajectories are not informative startTime endTime if i = 1; :::; n 1 i+1 i takeTime = (2) and are therefore outside the scope of our investigation. Long 0 if i = n trajectories indicating that the device stayed at a single location Each trajectory lasts for the length of one day. Days were for more than 16 hours in a single day were similarly discarded taken from 3:00 AM on one calendar day to 3:00 AM the next, as they are deemed to be anomalies. After this filtering, our to give allowance for some commercial or social activities that dataset consisted of more than 5.7 million trajectories from may cut across midnight. around 1.6 million devices over the period of 5 months. 5 C := x (4) j i i:ind =j This is a fast and simple method to cluster data, and it works well for data with features of a similar type - in this case, all features are numbers of people detected at a location within a certain hour. To find the best value of k to set as a parameter for this clustering, we plotted the sum-of-squared error plot as shown in Fig. 3, which indicates that the ideal value for k, located at the elbow point of the curve, is between 3 and 5. After testing out these values for k, a value of k = 4 turned out to give the best balance between specificity and interpretability, as a value of 3 could be broken down further whereas the value of 5 broke down the clusters in an Fig. 2: An example of merging certain consecutive entries unintuitive way. From the results, it also largely corresponds to along a trajectory. 4 main types of days with their own characteristics - working weekdays, Fridays, Saturdays and Sundays. IV. C LUSTERING BY TIME This section explores the first aspect of trajectory data clustering, which is clustering by time. This section will be split into two parts: the results of the clustering of calendar days according to each day’s features at each location, as well as some further analysis of highlighted clustering results. A. Clustering Results Clustering by time involves the extraction of features into a vector representing each day and clustering those vectors. Each day is segmented into 24 intervals of 1 hour each, starting from 3:00 AM on one day to 3:00 AM the next day. The number Fig. 3: Plot of the sum-of-squared error values for different of people appearing at each location within that hour was values of k. extracted from the data, and this forms a 1-by-24 feature vector for each day. Each vector was normalized using min-max normalization before being subjected to k-means clustering. The aim of this clustering is to investigate the possibility of inferring temporal context such as type of day in terms of weekday as compared to weekend based on the patterns of people count at each location in the Facility area. The k-means algorithm is chosen based on its scalability to large datasets and simplicity in choice of a single parameter. The k-means algorithm uses expectation-minimization to perform clustering as described in Eqn. 3 to 4 below. Firstly, k centroids C where j = 1,...,k are initialized within the feature space of the data set. Second, each data point is Fig. 4: Representation of results of clustering by time. Weeks temporarily assigned with an index equal to the index of the 1, 2, 16, and 21 are selected for clarity. Weeks 1 and 2 show centroid nearest to it. The index of the ith data point x public holidays (PH) that are grouped into the same cluster as out of the total number of N data points is referred to as Sundays, while weeks 16 and 21 are selected as an example ind (ind 2 f1; :::; kg) and is computed using Eqn. 3. The i i of ’normal weeks’ without PH or special days. distance metric commonly used is Euclidean distance. Next, the centroids of each cluster are reassigned to the arithmetic Fig. 4 shows a plot of the results of clustering by day in mean of all the data points in each cluster as in Eqn. 4, where calendar form. In each calendar, the column represents one n refers to the number of data points in cluster j. The last day of the week from Monday through Sunday, while each two steps are then repeated until convergence, which occurs row represents the week number of the data, with week 1 when the calculated means of the clusters do not change in being the first week of data collection, and so on. subsequent steps. Weeks 1 and 2 were selected to show the cluster assignment of public holidays (PH) and PH eves. The PH in weeks 1 and ind := arg minjjx C jj (3) i i j 2 are the squares with red borders. Weeks 16 and 21 were j 6 chosen to show the cluster assignment of a ”normal” week, between 11:00 AM and 2:00 PM, and between 5:00 PM to that is, without special days like PH or PH eves. 8:00 PM. These periods of time roughly correspond to office From the calendars of Malls 1 to 4 and the Hospital, it hours and meal times, so these could show the surges of people can be seen that the clustering generally follows the four arriving at the mall to have meals within those time periods. types of days below with overall at least 70% of instances The last peak that occurs in the evening could also represent falling into each respective cluster: Cluster 1 (dark blue) the evening shoppers. However, the first peak of people count mainly has working Mondays to Thursdays, Cluster 2 (brown) is lower for Cluster 2 than Cluster 1, and the last peak is mainly contains Fridays, Cluster 3 (yellow) mainly contains much higher. There could be an increase in people going to Saturdays, and Cluster 4 (light blue) mainly contains Sundays. the mall in the evenings on Friday/PH eve to shop as compared As seen from Table 1(a) to 1(f), 100% of the available PH to working Mondays to Thursdays. in the dataset are clustered in the same cluster as Sundays For Clusters 3 and 4, the curves are generally smooth, (Cluster 4), while PH eves are clustered in the same cluster increasing sharply in the late morning and plateauing across as Fridays (Cluster 2) in a proportion of 50% and above for the middle of the day before dropping back down to zero five out of the six buildings. after 10:00 PM. Cluster 3’s curve gently increases from 1:00 Now that we know that different types of days follow PM to 7:00 PM, reaching its highest point at 7:00 PM, while generally different patterns, the actual patterns of each day Cluster 4’s curve lacks a noticeable increase in the evening. are then investigated in the next subsection. This could be because people are more likely to stay out later in the evenings on Saturdays as they do not need to go to work the following morning, compared to Sundays when they do. B. Further Analysis of Clustering Results To examine how the actual patterns of people count vary V. CLUSTERING BY P ERSON in different clusters, an example of people count curves on different clusters of days from Mall 2 was visualised in Fig. 5. This section explores the second aspect of trajectory data The colors correspond to the cluster assignments in Fig. 4. clustering, which is clustering by person. This section will be split into two parts: the results of the clustering of individual trajectories, and the temporal analysis of highlighted clustering results based on the analysis of the above section. A. Clustering Results Individual trajectories within a single day had their features extracted into a single vector. The features used for clustering were the accumulated time detected in a single day at the hospital, shopping mall, educational institute, the residential estate (day time), and the residential estate (night time) re- spectively. The detected time at the residential estate was split into between 7:00 AM to 7:00 PM (day time) and 7:00 PM to 7:00 AM (night time). This split was performed as many retired elderly were observed to be staying at the estate and they are hypothesized to be more active in the day time, as opposed to the working adults or school children, who are likely to be more active in the residential estate later in the day or evening. These features were then used as input for clustering via the k-means clustering algorithm. A value for k has to be predetermined for use in k- means. Traditional indices for calculating the optimal number of clusters for k-means clustering, such as the silhouette index Fig. 5: Curves of device count versus time at Mall 2 on [48] and sum-of-squared error plot, suggests an ideal value of different days, separated based on the clustering results. The 5. However, when trying to use this value, the sizes of the four clusters are mainly corresponding to 1) Working Mondays clusters are greatly skewed. We therefore manually increase k to Thursdays, 2) Fridays/Public holiday eves, 3) Saturdays, to find a suitable value for clustering and end up with a value and 4) Sundays/PH. The x-axis of each plot represents the of 8. When the value of k was originally selected as 5, the time of day while y-axis represents normalized device count. people who stayed for a longer time at the shopping malls Upon first glance, the curves for Clusters 1 (Mondays to were grouped together to form a very large cluster. After the Thursdays) and 2 (Fridays and PH Eves) are very similar to value of k was increased to 8, this large cluster was further each other and very different from Clusters 3 (Saturdays) and divided into 3 smaller clusters, each representing one more 4 (Sundays and PH), which are also similar to each other. specific type of person. Similarly, the people who stayed the For Clusters 1 and 2, they each have three peaks of people longest time at the hospital were originally grouped together count, which tend to occur between 6:00 AM to 9:00 AM, in a large cluster in the k = 5 case and this large cluster was 7 TABLE I: Percentage of actual (row) vs clustered (column) days in each separate building (a) Hospital (b) Institute Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur 51.4 48.6 0.0 0.0 0.0 0.0 Mon-Thur 51.4 42.9 0.0 2.9 2.9 0.0 Fri 16.7 83.3 0.0 0.0 0.0 0.0 Fri 35.0 65.0 0.0 0.0 0.0 0.0 PH Eve 100.0 0.0 0.0 0.0 0.0 0.0 PH Eve 0.0 0.0 100.0 0.0 0.0 0.0 Sat 0.0 0.0 0.0 100.0 0.0 0.0 Sat 11.1 0.0 0.0 72.2 16.7 0.0 Sun 0.0 0.0 0.0 0.0 100.0 0.0 Sun 0.0 0.0 0.0 21.4 78.6 0.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 (c) Mall 1 (d) Mall 2 Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur 71.8 26.9 0.0 0.0 1.3 0.0 Mon-Thur 94.7 4.0 0.0 1.3 0.0 0.0 Fri 4.8 95.2 0.0 0.0 0.0 0.0 Fri 0.0 100.0 0.0 0.0 0.0 0.0 PH Eve 0.0 0.0 100.0 0.0 0.0 0.0 PH Eve 50.0 0.0 50.0 0.0 0.0 0.0 Sat 0.0 0.0 0.0 76.2 23.8 0.0 Sat 0.0 0.0 0.0 100.0 0.0 0.0 Sun 0.0 0.0 0.0 11.1 88.9 0.0 Sun 0.0 0.0 0.0 16.7 83.3 0.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 (e) Mall 3 (f) Mall 4 Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur Fri/PH Eve Sat Sun/PH Mon-Thur 79.5 20.5 0.0 0.0 0.0 0.0 Mon-Thur 83.1 16.9 0.0 0.0 0.0 0.0 Fri 9.5 90.5 0.0 0.0 0.0 0.0 Fri 5.3 94.7 0.0 0.0 0.0 0.0 PH Eve 50.0 0.0 50.0 0.0 0.0 0.0 PH Eve 50.0 0.0 50.0 0.0 0.0 0.0 Sat 0.0 0.0 0.0 95.2 4.8 0.0 Sat 0.0 0.0 0.0 90.0 10.0 0.0 Sun 0.0 0.0 0.0 0.0 100.0 0.0 Sun 0.0 0.0 0.0 17.6 82.4 0.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 PH 0.0 0.0 0.0 0.0 0.0 100.0 divided into 2 smaller clusters after k was increased to 8. As (blue). Each cluster will be discussed further in detail. a result, k = 8 is selected over 5 because it produces a more CP1 to CP4 are the clusters in which most, if not all, balanced and informative clustering result. This value was used trajectories have visited a shopping mall at least once. The to perform k-means clustering on the feature vectors. The aim amount of time stayed at a shopping mall per trajectory of this portion is to search for clusters of device trajectories increases in order from CP1 (less than 1 hour) to CP4 (7.5 and, from there, infer insights about a given device based on to 15 hours). The transition probability diagrams show heavy the cluster that its trajectory is assigned to. The results of the emphasis on the buildings in the Facility area (top right). The clustering and each part of our proposed analysis is shown most common transitions are those between Malls 2 and 3, in Fig. 6. Each cluster is labeled as ‘CP’, which stands for followed by the transitions between Mall 3 and the Hospital, Cluster of People, together with its corresponding number. as well as between Malls 2 and 4. All four of these clusters Row (a) in Fig. 6 contains the visualizations of the tra- also show a similar pattern in terms of number of nodes visited jectories of the individual clusters. The colors represent the per trajectory peaking at two, however in CP4 there is a much different types of buildings as stated in the legend. Row (b) higher probability of a trajectory containing a single building contains the plots of transition probabilities between each pair as compared to the other three clusters. The main difference of nodes within each cluster, in order to study the movement in these clusters lie in the distributions of start and end times. patterns. There are 20 nodes in total, with 6 from the Facility CP1 has three peaks for the start time (8:00 AM, 12:00 PM, area and 14 individual buildings from the Residential area, 6:00 PM), while CP2 and CP3 have two peaks (12:00 PM labeled alphabetically from ’a’ to ’n’. Row (c) contains the and 6:00 PM for CP2, 8:00 AM and 12:00 PM for CP3), histograms showing the distribution of number of unique and CP4 has only one (8:00 AM). These common timings nodes visited per trajectory in each cluster. Finally, Row (d) correspond with the common times to start work (8:00 AM), contains the probability distributions of start and end times of break for lunch (12:00 PM), and leave work for home (6:00 trajectories in each cluster. PM). Those people who appear in the shopping mall at 6:00 From row (a), these eight clusters can be broken down PM and stay there for at least an hour (CP2) could be going for visually into 3 groups of similar clusters - CP1 to CP4 being short shopping trips after work. Those who appear at 8:00 AM mainly at the shopping malls (gray), CP5 and CP6 being may either pass by the shopping malls on their way to work mainly at the hospital (purple), CP7 being mainly at the (CP1) or start working at the shopping mall in the morning Institute (pink), and CP8 being mainly at the Residential area (CP3 and CP4). However, since the time spent ranges from 8 Fig. 6: Illustrations of results from clustering data by person. (a) A visual representation of the eight clusters obtained using k-means clustering of trajectory stay times at differing locations. The horizontal axis represents the time of day from 3:00 AM to 3:00 AM the next day, while the color represents the location where the device was detected, as described in the legend. Trajectories are sorted by the time at which they were first detected. (b) Visualizations of transition probability between each pair of nodes within each cluster, for comparing the movement patterns between clusters. A darker and thicker line connecting a pair of nodes means that there are more transitions between those nodes. (c) Histograms describing the number of unique locations visited per trajectory within each cluster. The y-axis has been subjected to probability normalization. (d) Probability distributions of start times and end times of trajectories in each cluster. The x-axis represents the time of day and the y-axis represents the probability that a trajectory in the respective cluster starts or ends within that timing with a resolution of one hour. between 3.5 and 7.5 hours (CP3) as compared to between 7.5 than CP5. This indicates that the people in CP6 start and end and 15 hours (CP4), it can be inferred that trajectories in CP4 their trajectories over a much narrower time period than CP5, are more likely to belong to people working in the shopping which is in line with the consideration that people in CP6 may malls for long hours, while those in CP3 could belong to either be workers at the Hospital, while people in CP5 may be either people with shorter shifts, or long shopping trips. shift workers or visitors to the Hospital. CP7 is the cluster that contains trajectories with the longest CP5 and CP6 are the clusters in which all the trajectories time at the Institute, from 3.5 to 14.5 hours. Its corresponding have visited the Hospital and stayed for a relatively long time transition probability diagram is noticeably different from the (2 to 6.5 hours for CP5 and 7.5 to 15 hours for CP6). The others, with strong links between the Hospital and the Institute, transition probability diagrams both show strong links between as well as between Mall 3 and the Hospital. These strong links Mall 3 and Hospital, as well as Mall 4 and Hospital. Since may also be explained as above, due to the proximity of Mall Malls 3 and 4 are connected to a transport hub, these high 3 to the transport hub. Most of the trajectories in this cluster number of transitions could reflect the people taking public contained three unique nodes, which is different from the rest transport to and from the Facility area. The distribution of the as well, since the rest of the unique node distributions all number of nodes visited per trajectory is also similar, both peaked at two. The start and end time distributions are also peaking at two places and decreasing with increasing number unique, in that there is a single prominent sharp spike in each of places up till five or six. The main difference between these of the start and end times. The start time peak occurs at 8:00 two clusters lies in the distributions of the start and end times. AM with a probability value of almost 0.5, while the end time Both clusters have peaks in start time at 8:00 AM and 12:00 peak occurs at 6:00 PM with a value of between 0.3 to 0.4. PM, however the 8:00 AM peak for CP6 is much higher and This indicates that CP7 is likely to represent people who work the 12:00 PM peak is much lower as compared to CP5. For or study at the Institute. the ending times, both clusters have peaks at 5:00-6:00 PM and 9:00 PM, but the 6:00 PM peak is much higher for CP6 Lastly, CP8 is the cluster that contains trajectories with the 9 longest time in the Residential area (0.5 to 11 hours in the day 12:00 PM, and 5:00-6:00 PM, and one small peak at 9:00 PM. and 12 hours at night). Its corresponding transition probability However, the 12:00 PM peak for CP5 is the highest peak, while diagram has visually stronger links are between Residential the 5:00-6:00 PM peak is the highest for CP6. buildings instead of the Facility buildings, as compared to CP1 For CP5, the Saturday line peaks in the morning and at to CP7, which have very few links to the Residential buildings. 12, before decreasing gently throughout the rest of the day to There are also a few links between Residential and Facility nearly zero at 11:00 PM. In contrast, the Sunday/PH line for area, but these are considerably much less than the other CP5 lacks distinctive peaks, instead gently curving upwards clusters as well. The probability of each number of unique and then decreasing in a similar way to the Saturday line after locations per trajectory appears to decrease exponentially with 4:00 PM. On the other hand, the Saturday and Sunday/PH the number of places per trajectory increasing from two places lines for CP6 are relatively similar and both have four peaks of up till ten places, as compared to other clusters which appear approximately equal height spread throughout the day. These to be more of a parabola shape. One thing to note for the start peaks occur at 6:00 AM, 12:00 PM, 4:00 PM, and 9:00 PM. and end time distribution plot is that the probability values at The difference in the weekend behavior for these CPs, in 3:00 AM and 2:00 AM are much higher than those of other addition to the length of time spent at the Hospital for each clusters. These may reflect the trajectories of residents who cluster, supports the line of thinking that CP5 is more likely to stay at the Residential area overnight, past the cutoff time of represent visitors to the Hospital who go shopping afterwards, 3:00 AM when the day changes. while CP6 is likely to represents the people who are employed at the Hospital. B. Temporal Analysis of By-Person Clustering Results As part of a deeper analysis of the results of clustering trajectory data by person, we explored the temporal aspect of the clustered data by extracting the temporal data of each CP, similar to the extraction in Section III. We then manually grouped the vectors based on the 4 types of days, namely working Mondays to Thursdays, Fridays/PH eve, Saturdays, and Sundays/PH. We then plotted the average of each group, as well as the minimum and maximum boundaries, which is shown by the shaded area surrounding each line. Below, we highlight two results of our temporal analysis. Fig. 8: Daily average number of detected devices from CP8 recorded at the Residential area. Selected lines show com- parisons between (a) Mon-Thur and Saturday, (b) Mon-Thur and Sunday/PH. The shaded area represents the maximum and minimum bounds for each type of day. Fig. 8 depicts the results from CP8, the cluster of people recorded as having the longest stay time at the Residential area, as detected in the Residential area. Due to the length of stay at the Residential area, it is likely that they are residents of that area or have that area as a main destination for the day. Since the numbers of detected devices are very similar overall for Fig. 7: Daily average number of detected devices from (a) CP5 the four different types of days, two types of days were chosen and (b) CP6 recorded at the Shopping Malls. The shaded area in each subfigure to do a clear comparison. Fig. 7(a) shows represents the maximum and minimum bounds for each type the comparison between working Mondays to Thursdays and of day. Saturdays, while Fig. 7(b) shows the comparison between working Mondays to Thursdays and Sundays/PH. Fig. 7 depicts the results from CP5 and CP6, the top two For Fig. 7(a), it can be seen that more people are detected in clusters of people recorded as having the longest stay time the evenings on Saturdays as compared to working Mondays at the Hospital, as detected at the Shopping Malls. From the to Thursdays. This can reflect that more people from the proximity of locations, it can be inferred that most of these Residential area stay out late on Saturday nights as compared detections would be of people traveling through the Shopping to working weeknights. This makes sense if people tend to go Malls towards the Hospital, as the extremely long stay time home early when there is work the next day, as compared to at the Hospital for this cluster’s trajectories indicate a high Saturdays when there is a much lower probability of people likelihood of the trajectories having the Hospital as their main going to work on Sundays/PH. destination for the day. Upon first glance, it can be seen that for both cases, the Mon-Thur and Friday/PH Eve lines are Fig. 7(b) shows that there are more devices detected before very close to each other, and much higher than the Saturday 10:00 AM on working Mondays to Thursdays as compared and Sunday/PH lines. The Mon-Thur and Friday/PH Eve lines to on Sundays/PH, but there are more devices detected after in both cases have three large peaks as well at 7:00-8:00 AM, 10:00 AM on Sundays/PH than on working Mondays to Thurs- 10 days. This could indicate that the residents generally wake up are defined as the squared Euclidean distance between points. or leave the house later in the mornings on Sundays/PH than We hypothesize that the buildings will form roughly spherical on working Mondays to Thursdays. clusters based on the map, and Ward’s linkage is suitable for this. VI. CLUSTERING BY LOCATION X X WCV = jjx  jj (6) This section addresses the final aspect of clustering ad- i j j i:ind =j dressed in this study, which is clustering by location. The i first part of this section presents the results of clustering The above process was done separately for three different of the transition probability matrices illustrating transition time periods of a day for the whole dataset. The first selected probabilities between pairs of buildings, while the second time period was between 6:00 AM to 10:00 AM, which is part provides a further examination into the transition patterns the time when people generally have breakfast, leave their extracted from the data. houses, or arrive at their workplace. The second selected time period was between 11:00 AM to 2:00 PM, which is the time A. Clustering Results when people generally have lunch, and thus there could be a After clustering the data by temporal patterns as well as by more prominent movement around the malls. Lastly, the third selected time period was between 6:00 PM and 10:00 PM, individual patterns, the third aspect of trajectory data clustering which is generally the time when people working office hours is to look at the spatial patterns. The number of detected tran- leave work, have dinner, or return home. The results of HAC sitions between each pair of nodes is extracted and compiled by location for each time period are shown in Fig. 9. From into a transition probability matrix, where each row denotes the a general overview, it can be easily seen that the clustering probability of people moving out of the corresponding source results differ for each time of the day. node, and each column denotes the probability of people moving towards the corresponding destination node. For this In Fig. 8(a), the dendrograms show that the groupings of matrix, we considered 20 locations - 6 from the Facility area similar buildings change with different timings of the day. as well as 14 individual buildings from the Residential area. The first six labels of each dendrogram represent the Malls, The transition probability for each entry of the matrix were Institute and Hospital. They are grouped into pairs in the calculated using the below equation: morning and afternoon, while they are grouped in threes in the evening. The yellow and purple pairs also change members N(i; j) T (i; j) = (5) between morning and afternoon. For a better illustration of N(i; k) k2[1;20];k6=i the groupings with respect to the building map, a simplified version is shown in Fig. 8(c). The remaining 14 labels of each where T (i; j) refers to the entry of the transition probability dendrogram represent the buildings in the Residential area. matrix in row i and column j for i 6= j, and N(i; j) refers to These also differ throughout the day, and an illustration can the number of transitions observed in the data moving from be seen in Fig. 8(d). The buildings that were from a different node i to node j for i 6= j. The diagonal entries of the matrix, grouping in the previous time period have bolded outlines. The indicating probability of each node going back to itself, were differences between the groupings of the Residential area are then set to 1 to fill up the matrix. As the entries in the input described in the following paragraph. matrix are distances rather than coordinates in a feature space, the use of k-means clustering is less suitable. Thus, in this There is one cluster in the Residential area that stays the case, the transition probability matrix was then subject to HAC same throughout all three time periods, represented by the as described in [46]. orange cluster. The buildings that this cluster corresponds to HAC is an algorithm designed to cluster data points together are located on one end of the Residential area and thus they based on a given distance matrix. Many different types of may have similar transition probabilities that differ from the rest of the Residential area. The pink cluster stays the same linkages can be used such as average linkage, single linkage, through the morning and afternoon, but has an additional Ward linkage, and so on. The result of HAC can be visualized member in the evening that was originally part of the green in the form of a dendrogram, which has all the nodes at the cluster. The red cluster present in the morning was grouped bottom as separate ’leaves’, which are joined together pairwise with the dark blue cluster for the rest of the day. by ’branches’, until all the clusters are joined together at the very top. The rows and columns of the transition probability In Fig. 8(b), the transition probability matrices offer more matrix would then be rearranged simultaneously according insight on the general flow of people around the area. One to the output dendrogram. In this case, Ward’s method [49] prominent observation is that although there is a visible prob- is used to calculate the linkages between different clusters ability that people from the Residential area move towards the and data points at each level. Ward’s method is an objective Facility area, there is a very low probability of them moving function approach involving the pairing of clusters at each in the opposite direction. A possible reason for this is that step that results in the minimum increase in the total within- the people coming from the Residential area only contribute cluster variance after merging. The total within-cluster vari- to a small percentage of the overall number of visitors to the ance (WCV) is shown in Eqn. 6, where x represents data point Facility area, and as a result there is a much larger number i, j represents the cluster number, and  represents the mean of transitions between the Facility buildings as compared to of all the points within cluster j. The initial cluster distances the number of transitions moving from the Facility buildings 11 Fig. 9: Diagrams showing different clustering arrangements of locations over the course of the day. (a) Dendrograms produced from HAC, using the transition probability matrices in (b) as input. (c) Corresponding map of locations in the Facility area. (d) Corresponding map of locations in Residential area. Buildings in grey do not have sensors installed. Buildings that changed from a different grouping in the previous time period have bolded outlines, e.g. afternoon different from morning. 12 toward the Residential buildings. This is largely consistent arrows. Subsequently, the number of dominant flows towards throughout the three time periods. As this large discrepancy Mall 1 decreases in the afternoon, and finally, in the evening, makes it difficult to directly identify dominant directions from all the dominant directions are flowing outwards from Mall 1, Fig. 8(b), a clearer visualization of dominant directions will as shown by the blue arrows. be provided later in Fig. 10. Next, Fig. 10(b) shows an example of continuous flow Another observation is that the direction of the largest towards a specific building throughout the day. The building of transition probabilities, represented by the darkest squares, interest in this figure is Building j, in the Residential area. At change over time in the Facility area. In the mornings, the all three time periods, the dominant outward flow (indicated main direction is from Malls 3 and 4 toward the Hospital, as by the blue arrow) is always towards its neighboring building, well as from Mall 2 to Mall 4. For the afternoon, the transition Building k, in the Residential area. probabilities between Malls 2 and 3 are observably higher Lastly, Fig. 10(c) shows an example of continuous flow in a than those in the morning, and they are higher still in the general direction throughout the day. The building of interest evening. There is also a larger transition probability in both is Building h, another building in the Residential area. Its directions between the Hospital and Mall 3, as well as a higher inflows, shown by red arrows, tend to come from the left and probability of transition from the Institute to the Hospital. For lower parts on the map, while the outflows, shown by blue the evening, two of the main directions are reversed from the arrows, tend to go towards the top and right. This pattern morning, coming from Mall 4 to Mall 2, and from Hospital appears in all different time periods of the day as well. to Mall 3. This could mean that there is an outflow from the transportation hub to the rest of the Facility area in the VII. C ONCLUSION morning, while the flow is opposite in the evening. This may in turn indicate that the bulk of these detected transitions come In this paper, we proposed a systematic approach to analyze from people who work within the Facility area during office trajectory data obtained through passive Wi-Fi sensing. We hours. used two unsupervised machine learning techniques, k-means clustering and HAC, to examine three different aspects of As for the Residential area, it can be seen from Fig. 8(b) clustering of trajectory data, namely by time, by person, that there is a noticeably high probability of arriving at building b from several other buildings since the column of the and by location. In doing so, we observed patterns of daily transition probability matrices corresponding to building b has movement such as the fluctuation of people count over the mostly darker squares than the rest. This trend shows minimal course of the day, clusters of trajectories belonging to different change throughout the time periods. However, there is a lower types of people, as well as the relative volumes of flow probability of transitions from the Residential buildings to the between different buildings. We also provided various ways Facility buildings in the evening period as compared to the of visualization for the clustering results. morning and afternoon periods. For future work, the proposed approach can be performed on datasets gathered from different locations, such as for different residential estates and facilities, and a comparison can be done B. Analysis of Transition Patterns to identify differences such as the difference in trajectory In order to have a deeper analysis of the spatial patterns, the patterns of people living in mature estates as compared to dominant transition directions between each pair of buildings newer estates, or estates with different proximities to different were plotted in Fig. 10. A direction of transition between a sets of facilities. The findings could help in future work related pair of buildings is considered dominant if it occurs with a to urban planning in the following ways. Clustering by time probability higher than 0.55. This probability is calculated by gives insights on the daily footfall in various buildings on each taking the number of transitions in one direction and dividing type of day, therefore event planning could be more informed, by the sum of the number of transitions in both directions. or building tenant proportions could be adjusted if deemed to At first glance of Fig. 10, one observation that stands out is have an effect on the footfall. Clustering by person gives an that in the morning, a large block of the dominant directions idea of rough proportions of travel patterns of users living in start from the Residential area and lead to the Facility area, an estate, and facilities within the estate can also be adjusted while it is the reverse case in the evening. This agrees with a based on need. Finally, clustering by location can help in the general understanding that humans will go out to work in the planning of the land use in upcoming estates, depending on morning and return home in the evening. how the users flow from building to building. The rest of the buildings’ flows are analyzed through plot- ting the dominant directions on maps, focusing on the inflow ACKNOWLEDGMENT and outflow of one building at a time, called the building of This research is supported by the Singapore Ministry of interest. Patterns can be more easily identified visually, and National Development and the National Research Foundation, three such patterns are highlighted in Fig. 11. These include Prime Minister’s Office under the Land and Liveability Na- reversal of flow at different times of day, continuous flow tional Innovation Challenge (L2 NIC) Research Programme towards a specific building, and continuous flow in a general direction. Firstly, in Fig. 10(a), the building of interest is Mall (L2 NIC Award No. L2NICTDF1-2017-4). 1, located in the Facility area. In the morning, the dominant Any opinion, findings, and conclusions or recommendations flows are all inwards towards Mall 1, as shown by the red expressed in this material are those of the author(s) and do not 13 Fig. 10: A plot of dominant directions with probability larger than 0.55. Dominant directions are shown in black. REFERENCES [1] Y. Zheng, Y. Liu, J. Yuan, and X. Xie, “Urban computing with taxicabs,” in Proceedings of the 13th International Conference on Ubiquitous Computing. ACM, 2011, pp. 89–98. [2] R. A. Becker, R. Caceres, K. Hanson, J. M. Loh, S. Urbanek, A. Var- shavsky, and C. Volinsky, “A tale of one city: Using cellular network data for urban planning,” IEEE Pervasive Computing, vol. 10, no. 4, pp. 18–26, 2011. [3] Y. Mowafi, A. Zmily, I. Dhiah el Diehn, and D. Abu-Saymeh, “Tracking human mobility at mass gathering events using WISP,” in Second Inter- national Conference on Future Generation Communication Technologies (FGCT 2013). IEEE, 2013, pp. 157–162. [4] D. Zhang, L. Guo, L. Nie, J. Shao, S. Wu, and H. T. Shen, “Targeted ad- vertising in public transportation systems with quantitative evaluation,” ACM Transactions on Information Systems (TOIS), vol. 35, no. 3, p. 20, [5] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi, “Understanding individual human mobility patterns,” nature, vol. 453, no. 7196, p. 779, [6] R. Becker, R. Caceres, ´ K. Hanson, S. Isaacman, J. M. Loh, M. Martonosi, J. Rowland, S. Urbanek, A. Varshavsky, and C. Volinsky, “Human mobility characterization from cellular network data,” Commu- nications of the ACM, vol. 56, no. 1, pp. 74–82, 2013. [7] S. Jiang, J. Ferreira, and M. C. Gonzalez, “Activity-based human mobility patterns inferred from mobile phone data: A case study of Singapore,” IEEE Transactions on Big Data, vol. 3, no. 2, pp. 208–219, [8] M. Zhang, H. Fu, Y. Li, and S. Chen, “Understanding urban dynamics from massive mobile traffic data,” IEEE Transactions on Big Data, vol. 5, no. 2, pp. 266–278, 2017. [9] Honglian Ma, Huchuan Lu, and Mingxiu Zhang, “A real-time effective system for tracking passing people using a single camera,” in 2008 7th World Congress on Intelligent Control and Automation, 2008, pp. 6173– [10] R. Eshel and Y. Moses, “Tracking in a dense crowd using multiple cameras,” International Journal of Computer Vision, vol. 88, no. 1, pp. 129–143, 2010. [11] A. V. Kurilkin, O. O. Vyatkina, S. A. Mityagin, and S. V. Ivanov, “Evaluation of urban mobility using surveillance cameras,” Procedia Computer Science, vol. 66, pp. 364–371, 2015. Fig. 11: Highlighted spatial results of analysis of dominant [12] A. Galati and C. Greenhalgh, “Human mobility in shopping mall flows over different time periods. (a) Reversal of flows. (b) environments,” in Proceedings of the Second International Workshop Continuous outward flow to one specific building. (c) Contin- on Mobile Opportunistic Networking, 2010, pp. 1–7. [13] Y. Yoshimura, S. Sobolevsky, C. Ratti, F. Girardin, J. P. Carrascal, J. Blat, uous flow in a general direction. and R. Sinatra, “An analysis of visitors’ behavior in the louvre museum: A study using bluetooth data,” Environment and Planning B: Planning and Design, vol. 41, no. 6, pp. 1113–1131, 2014. [14] N. Nunes, M. Ribeiro, C. Prandi, and V. Nisi, “Beanstalk: a community reflect the views of the Singapore Ministry of National Devel- based passive wi-fi tracking system for analysing tourism dynamics,” in opment and National Research Foundation, Prime Minister’s Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Office, Singapore. Computing Systems. ACM, 2017, pp. 93–98. 14 [15] M. Traunmueller, N. Johnson, A. Malik, and C. E. Kontokosta, “Digital [37] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Traces: Modeling Urban Mobility using Wifi Probe Data,” in 6th Ordering points to identify the clustering structure,” ACM Sigmod International Workshop on Urban Computing, ACM KDD, 2017. record, vol. 28, no. 2, pp. 49–60, 1999. [16] M. W. Traunmueller, N. Johnson, A. Malik, and C. E. Kontokosta, [38] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on “Digital footprints: Using WiFi probe and locational data to analyze Information Theory, vol. 28, no. 2, pp. 129–137, 1982. human mobility trajectories in cities,” Computers, Environment and [39] S. Im, M. M. Qaem, B. Moseley, X. Sun, and R. Zhou, “Fast noise Urban Systems, vol. 72, pp. 4–12, 2018. removal for k-means clustering,” arXiv preprint arXiv:2003.02433, [17] S. Van der Spek, J. Van Schaick, P. De Bois, and R. De Haan, “Sensing [40] A. Bhaskara, S. Vadgama, and H. Xu, “Greedy sampling for approximate human activity: GPS tracking,” Sensors, vol. 9, no. 4, pp. 3033–3055, 2009. clustering in the presence of outliers,” in Advances in Neural Information Processing Systems, 2019, pp. 11 148–11 157. [18] S. H. Marakkalage, S. Sarica, B. P. L. Lau, S. K. Viswanath, T. Bala- [41] W. Tang and T. M. Khoshgoftaar, “Noise identification with the k- subramaniam, C. Yuen, B. Yuen, J. Luo, and R. Nayak, “Understanding means algorithm,” in 16th IEEE International Conference on Tools with the lifestyle of older population: Mobile crowdsensing approach,” IEEE Artificial Intelligence. IEEE, 2004, pp. 373–378. Transactions on Computational Social Systems, vol. 6, no. 1, pp. 82–95, [42] S. Ben-David and N. Haghtalab, “Clustering in the presence of back- ground noise,” in International Conference on Machine Learning, 2014, [19] T. Hu, E. Bigelow, J. Luo, and H. Kautz, “Tales of two cities: pp. 280–288. Using social media to understand idiosyncratic lifestyles in distinctive [43] W. S. Manjoro, M. Dhakar, and B. K. Chaurasia, “Operational analysis metropolitan areas,” IEEE Transactions on Big Data, vol. 3, no. 1, pp. of k-medoids and k-means algorithms on noisy data,” in 2016 Inter- 55–66, 2016. national conference on communication and signal processing (ICCSP). [20] N. M. Yip, R. Forrest, and S. Xian, “Exploring segregation and mobil- IEEE, 2016, pp. 1500–1505. ities: Application of an activity tracking app on mobile phone,” Cities, [44] B. Schelling and C. Plant, “KMN – Removing noise from k-means vol. 59, pp. 156–163, 2016. clustering results,” in International Conference on Big Data Analytics [21] L. Schauer, M. Werner, and P. Marcus, “Estimating crowd densities and Knowledge Discovery. Springer, 2018, pp. 137–151. and pedestrian flows using wi-fi and bluetooth,” in Proceedings of [45] Z. He and C. Yu, “Clustering stability-based evolutionary k-means,” Soft the 11th International Conference on Mobile and Ubiquitous Systems: Computing, vol. 23, no. 1, pp. 305–321, 2019. Computing, Networking and Services, 2014, pp. 171–177. [46] D. Mullner ¨ , “Modern hierarchical, agglomerative clustering algorithms,” [22] P. Sapiezynski, A. Stopczynski, R. Gatej, and S. Lehmann, “Tracking arXiv preprint arXiv:1109.2378, 2011. human mobility using wifi signals,” PloS one, vol. 10, no. 7, 2015. [47] K. Li, C. Yuen, S. S. Kanhere, K. Hu, W. Zhang, F. Jiang, and X. Liu, [23] Y. Chon, S. Kim, S. Lee, D. Kim, Y. Kim, and H. Cha, “Sensing “An Experimental Study for Tracking Crowd in Smart Cities,” IEEE WiFi packets in the air: practicality and implications in urban mobility Systems Journal, 2018. monitoring,” in Proceedings of the 2014 ACM International Joint [48] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and Conference on Pervasive and Ubiquitous Computing. ACM, 2014, validation of cluster analysis,” Journal of Computational and Applied pp. 189–200. Mathematics, vol. 20, pp. 53–65, 1987. [24] M. V. Barbera, A. Epasto, A. Mei, V. C. Perta, and J. Stefa, “Signals from [49] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” the crowd: uncovering social relationships through smartphone probes,” Journal of the American Statistical Association, vol. 58, no. 301, pp. in Proceedings of the 2013 Internet Measurement Conference, 2013, pp. 236–244, 1963. 265–276. [25] A. Basalamah, “Crowd mobility analysis using wifi sniffers,” Int J Adv Comput Sci Appl, vol. 7, pp. 374–378, 2016. [26] J. McAuley, C. Roux, and J. Little, “Towards Approaches and Tech- niques for Analysing WiFi Location Data,” The 25th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, 2017. [27] A. Alessandrini, C. Gioia, F. Sermi, I. Sofos, D. Tarchi, and M. Vespe, “Wifi positioning and Big Data to monitor flows of people on a wide scale,” in 2017 European Navigation Conference (ENC). IEEE, 2017, Zann Koh received the B.Eng degree in Engineer- pp. 322–328. ing and Product Development from the Singapore [28] Y. Zhou, B. P. L. Lau, Z. Koh, C. Yuen, and B. K. K. Ng, “Understanding University of Technology and Design, Singapore, in Crowd Behaviors in a Social Event by Passive WiFi Sensing and Data 2017. She is currently pursuing the Ph.D. degree Mining,” IEEE Internet of Things Journal, 2020. with the Singapore University of Technology and [29] A. J. Ruiz-Ruiz, H. Blunck, T. S. Prentow, A. Stisen, and M. B. Design, Singapore, under Dr. Chau Yuen’s supervi- Kjærgaard, “Analysis methods for extracting knowledge from large-scale sion. Her current research interests include big data wifi monitoring to inform building facility planning,” in 2014 IEEE analysis, data discovery, urban human mobility, and International Conference on Pervasive Computing and Communications unsupervised machine learning. (PerCom). IEEE, 2014, pp. 130–138. [30] E. Kalogianni, R. Sileryte, M. Lam, K. Zhou, M. Van der Ham, S. Van der Spek, and E. Verbree, “Passive wifi monitoring of the rhythm of the campus,” in Proceedings of The 18th AGILE International Conference on Geographic Information Science, 2015, pp. 9–14. [31] J. Shen, J. Cao, X. Liu, and S. Tang, “SNOW: Detecting shopping groups using wifi,” IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3908– 3917, 2018. [32] J. Shen, J. Cao, and X. Liu, “BaG: Behavior-aware group detection in Yuren Zhou received the B.Eng. degree in Electrical crowded urban spaces using wifi probes,” IEEE Transactions on Mobile Engineering from Harbin Institute of Technology, Computing, 2020. Harbin, China in 2014, and the Ph.D. degree from [33] Y. Zhou, B. P. L. Lau, C. Yuen, B. Tunc ¸er, and E. Wilhelm, “Un- Singapore University of Technology and Design, derstanding urban human mobility through crowdsensed data,” IEEE Singapore in 2019, with a focus on data mining Communications Magazine, vol. 56, no. 11, pp. 52–59, 2018. and smart city applications. He is currently a post- [34] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward doctoral research fellow at Singapore University of feature space analysis,” IEEE Transactions on Pattern Analysis and Technology and Design. His current research inter- Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002. ests include big data analytics and its application in [35] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data urban human mobility, building energy management, clustering method for very large databases,” ACM SIGMOD Record, and Internet of Things. vol. 25, no. 2, pp. 103–114, 1996. [36] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in KDD, vol. 96, no. 34, 1996, pp. 226–231. 15 Billy Pik Lik Lau received the degree in computer science and M.Phil. degree in computer science from Curtin University, Perth, WA, Australia, in 2010 and 2014, respectively. He is currently a Ph.D. candidate with Dr. Chau Yuen at the Singapore University of Technology and Design, Singapore. He studied the cooperation rate between agents in multiagents systems during master studies. His current research interests include urban science, big data analysis, data knowledge discovery, Internet of Things, and unsupervised machine learning. Chau Yuen is currently an Associate Professor at Singapore University of Technology and Design. He received the B.Eng. and Ph.D. degrees from Nanyang Technological University, Singapore, in 2000 and 2004, respectively. He was a Postdoctoral Fellow at Lucent Technologies Bell Labs, Murray Hill, NJ, USA, in 2005. He was a Visiting Assistant Professor at The Hong Kong Polytechnic University in 2008. From 2006 to 2010, he was a Senior Research Engineer at the Institute for Infocomm Research (I2R, Singapore), where he was involved in an industrial project on developing an 802.11n Wireless LAN system, and participated actively in 3Gpp Long Term Evolution (LTE) and LTE-Advanced (LTE-A) Standardization. He has been with the Singapore University of Technology and Design since 2010. He is a recipient of the Lee Kuan Yew Gold Medal, the Institution of Electrical Engineers Book Prize, the Institute of Engineering of Singapore Gold Medal, the Merck Sharp and Dohme Gold Medal, and twice the recipient of the Hewlett Packard Prize. He received the IEEE Asia-Pacific Outstanding Young Researcher Award in 2012. He serves as an Editor for the IEEE Transaction on Communications and the IEEE Transactions on Vehicular Technology and was awarded the Top Associate Editor from 2009 to 2015. Bige Tunc ¸er is an associate professor at the Ar- chitecture and Sustainable Design Pillar of Singa- pore University of Technology and Design (SUTD), where she founded the Informed Design Lab. The lab’s research focuses on data collection, information and knowledge modeling and visualization, for in- formed architectural and urban design. She received her PhD in Architecture from Delft University of Technology (TU Delft), her MSc (computational design) from Carnegie Mellon University, and her BArch from Middle East Technical University. She was an assistant professor at TU Delft, a visiting professor at ETH Zurich, a visiting scholar at MIT, and a visiting professor at Computer Engineering Department of University of Pavia, Italy. Her research interests include evidence based design, big data informed urban design, and design thinking. She leads and participates in various large multi-disciplinary research projects in evidence informed design, IoT, and big data. Keng Hua Chong is Associate Professor of Ar- chitecture and Sustainable Design at the Singa- pore University of Technology and Design (SUTD), where he directs the Social Urban Research Groupe (SURGe) and co-leads the Opportunity Lab (O-Lab). His research on social architecture particularly in the areas of ageing population, liveable place and data-driven collaborative design has led to several key publications and projects, including Creative Ageing Cities, Second Beginnings, and the New Urban Kampung Research Programme.

Journal

StatisticsarXiv (Cornell University)

Published: Dec 22, 2020

There are no references for this article.