Once again, it’s time for some number crunching fun… Today, I looked at the interrelationship of first names within Twitter’s Follower graph and got some beautiful results.
For the analysis, I used an excerpt of the Follower graph, consisting of 1,486,403 users and 72,590,619 links (as described here), as well as the name co-occurrence graph based on the English Wikipedia corpus which is used for calculating name similarities in Nameling (as described in the Nameling papers). The fist names of Twitter users were extracted from the users’ profile data, where a user may provide her or his full name. Of course, many users just entered some fantasy name. Accordingly, the first token of the provided name string which matched against our list of known names was chosen as the user’s first name. This process induces some noise into the data, but due to the vast number of considered pairs of users, this effect should be neglectable.
Now, relative to 3,078 randomly chosen users (our Linux cluster is still crunching on more), I calculated the average name similarity of direct neighbours in the Follower graph, the average name similarity between pairs of users at a (shortest path) distance of two, …of three, and so on. For reference, I also added the total average name similarity for all considered pairs of users, as depicted by the grey dashed line. Finally, the error bars correspond to the 95% confidence interval.
As we can see, users which are located more closely within the follower graph tend to have more similar names than distant users. Additionally, a monotonically decreasing dependency between the average name similarity and the shortest path distance in the follower graph can be observed. Moreover, users at a distance up to three tend to have more similar names than in average, whereas users with shortest path distances above three tend to have less similar names than in average.
Stay tuned for more results (eg. considering the ReTweet graph and ReTweet frequency) and happy number crunching!