Instagram dataset crawling
Instagram is a mobile oriented photo-sharing online platform, including social network features. We crawled a Instagram subgraph by querying the Instagram users, relationships, media, comments, likes and tags APIs.
In order to ensure adequate levels of consistency in user relationships as well as topical variety in media properties, our crawling strategy was based on the retrieving of users that belong to a relatively large “community” (e.g., thematic channel) in Instagram.
Since Instagram does not offer an explicit group/community feature, we focused our crawling on the Instagram weekend hashtag project (WHP) promoted by the Instagram’s official blog.
We selected 73 popular contest tags, whose associated media were chosen as seed for the crawling process. Our list of selected WHP contest tags is available here . On this page, we will make our Instagram dump available in an anonymized form.
We then built a directed graph based on follower-followee relations among seed media’s authors, filtering out any user who did not have any neighbor belonging itself to the set of media authors.
Fuzzy c-means clustering setting
For fuzzy c-means clustering, the fuzzifier (m) and the number of clusters (c) have to be chosen in advance. The fuzzifier can be tuned in order to prevent the detection of clusters in random data; for this purpose, we resorted to the fuzzifier estimation function proposed in [SchwammleJ10] which works in an unsupervised fashion by taking into account variations of local properties of the original dataset structure in randomized datasets.
As concerns the setting of an optimal number of clusters, (which is usually challenging, especially for short time series and in case of overlapping clusters),
we tested a range of c and monitored the values measured for an internal cluster validity criterion; as suggested in [SchwammleJ10], the minimum distance between cluster centroids (Dmin) can be an effective criterion, since it is expected to decrease more slowly after reaching an “optimal” c. In effect, in all our experiments, the evidence of a clear structure of clusters due to visual inspection was consistently confirmed by both the minimization of the global overlap in the clustering as well as a functional analysis of Dmin.
[SchwammleJ10] V. Schwämmle and O. N. Jensen, “A simple and fast method to determine the parameters for fuzzy c-means cluster analysis”, Bioinformatics, vol. 26, no. 22, pp. 2841–2848, 2010.