Clustering - Don't Overthink This

# Clustering By:: [[Ross Jackson]] 2022-11-10 There is a human tendency to want to group and ungroup things. A child walking into a room in which there is a toy chest will likely pull out all the toys to see what’s there – ungrouping. A child walking into a room in which a bunch of toys are scattered around the room will likely put some order to the toy – grouping. This tendency plays out again and again throughout life. And so, it makes sense that analysts might want to group too. Clustering is one method available to analysts for assessing groupings within data. Using k-means clustering is an approach in which an analyst selects the number of groups. If k = 1, there would be one group. If k = 3, there would be three groups. There is no right answer to the number selected for k. Some selections will be potentially informative. Other selections certainly won’t be. An example might help explore how and why this might be the case. Let’s say that one is analyzing wealth and crime data for a city consisting of 30,000 people. How many groups should an analyst select? As indicated, some selections of the k-value will not be particularly insightful. Selecting k = 1 will result in one large group. This is the entire city of 30,000 people. We know that this group is a city, which means that grouping doesn’t tell us much. Likewise, selecting k = 30,000 will result in a group of individuals. Each person is a group. Again, we know that each person is unique. So, what would an informative k be? There isn’t a definitive answer here. There are a couple of approaches available. One could simply start with 2 and incrementally increase the k-value from there, examining each result visually to see when the number of groupings appears to suggest something. This is the try-and-see approach. Alternatively, one could take a piece of information one knows about what is being analyzed and use that as the basis. In this example, the city might consist of 5 neighborhoods. One might select k = 5 to see if the data cluster significantly into that number of groups. This is the theory-based approach. How one clusters the data will influence what one sees and what one can articulate. One might want to examine the city (k = 1), neighborhoods (k = 5), social classes (e.g., upper class, middle class, lower class; k = 3), or individuals (k = 30,000). The selection is consequential. In clustering, one should understand what one is attempting to understand. Simply letting the computer optimize the selection of k can be useful, but it leaves significant work to be done. Maybe the optimized computer value is k = 2. Understanding why this might be a good way of clustering these data would take context. It might be that one neighborhood is separated from the police station by a train track. The police can respond within 3 minutes to four of the neighborhoods, but depending on train traffic, it can take 5-15 minutes to respond to the one on “the other side of the tracks.” Clustering in this way could highlight that the city would benefit from building an overpass. Clustering reveals that analysts are called to be creative, theory-infused, technicians. Each offers something to gain insight through analysis. Analysis is less about determining the right answer and more about informing collective understandings. #### Related Items [[Analytics]] [[Thinking]] [[Creative]] [[Decision-making]] [[Data]]