Python meets Instagram — Applied Network Analytics + case study

Juan Felipe Alvarez Jaramillo
6 min readNov 13, 2020

--

First it was Facebook: the primal online tool that connected people and enabled them to build communities. Then, our online communications patterns evolved to become more photo-centric and Instagram came in to dominate the social media scene, accompanied by several other players trying to ride the wave for the next big thing. Now that we are changing the preferred content we consume, short videos are shaping the future of online networking tools.

Even though competition for active users is getting more intense, Instagram seems to be doing alright. Estimates for 2019 show that the platform had at least 1Bn monthly active users; a figure that surely has gone upwards, accelerated by the extraordinary settings of 2020.

Source: Our World in Data, Statista, TNW

Given that Instagram still is and seems to be able to retain the crown of the “king of social” for the foreseeabl future, I have decided to write this blog about network analytics for Instagram. If you have seen some of my previous blogs, you already know that one of the analysis that really enthuses me is the one related to network theory. In this blog I will show you how I framed the problem, then used python to obtain the necessary data and then created a Network object, susceptible of being analysed. I hope you find this useful and are able to replicate my method for your own explorations.

Framing the analysis

A network represents the different ways entities are related to each other. A very simplistic way of scoping the problem in the context of social media, would be with the accounts people follow. For example, if you follow my profile, we could define that a connection exists between 2 nodes of the network (we could be more specific and determine if the connection is uni or bi-directional, but I chose to keep it simple for this exercise). The universe of social media users is vast and we need also to narrow down the scope to a smaller subset of study, for example limiting the set of accounts to study to those that already follow one account. That would make the problem easier to handle:

So, in order to pull out a network as the one described above, you will need two things: 1) a list of all your followers and, 2) a list of the accounts that each of your followers follow. Once you have that, you will be ready to start with the analysis.

Getting the data

Instagram and its parent company have been recently under a lot of scrutiny in terms of privacy and consequently have heavily modified the capabilities of its API. Since these changes could still happen, I decided to learn how to collect the data independently, using other other open-source tools, like the python library Instapy.

Instapy is a tool that automates repetitive tasks using a web driver environment (i.e. a bot that performs repetitive actions using your browser). From its many functionalities, I will be using one called grab.following that works like this: given an target account name, Instapy will capture a list with all the accounts that the target accounts follows. With a bit of iteration, you could set a script like the one below to plough through the complete list of the accounts defined in your scope of study. For the sake of simplicity, I had provided a list of three target accounts, but in the real application, I would use a complete list of my followers (to do this, you can use Instapy’s grab.followers function first to obtain the list of all the people that follow you).

This will output a dataframe with the relationships between the three target accounts and all the accounts that they follow that we can use as input for the next section.

Some thoughts about this step: You can choose a cut-off limit (“amount”) for the number of followers in order to avoid the bot collecting thousands of accounts that probably will not be used later (after all, you are only interested in the connections among the people that follow you). Also, Instapy uses a smart time-out function that will snooze the bot before attempting retrieve large number of followers. While this is useful, I would suggest to break down your complete list of followers in several chunks to avoid losing the retrieved data if your IP is temporarily blocked and the script stops.

Visualising and getting insights from the network

Now that the data is at your fingertips (I must mention that all the data retrieved is publicly available and can be consulted by both humans and bots alike), we can jump into the fun part. I use another python library called NetworkX to assemble the network, configure some of its attributes, calculate its basic statistics and another great library called Netwulf for creating stunning visualizations of the network.

Wth NetworkX I can do some extra analysis like finding the best community partition (line 15 below) and using that output as the group attribute for colour coding the network later. I can also use the Vote Rank algorithm to find the most influential nodes (users) of the network of people that follow me (row 21).

With Netwulf I am able to set the node size relative to the strength of the node (i.e. its density score, aka % of the connections that a particular node have of all the possible connections in the network). It is also a great tool to easily choose the best colouring scheme and other aspects of the network that affect its final display. This is what the Netwulf interface looks like:

Conclusion

So, here is the final visual of the network I created.

The Louvain method for community detection, based on the density of the connections, found 6 possible sub-communities within the total network, which are colour coded in the picture above. Its interesting that without having an input about user’s location, the algorithm is capable of clustering people that live in the same countries based on the connection information only. It also makes an interesting attempt to divide my different groups of friends, work colleagues and acquaintances.

Some other interesting statistics include:

  • The median degree of connections per follower is 12.
  • The network density is 1.3% (actual % of all possible connections between followers)

It was an interesting exercise! The trickiest part was getting the data, but once you get the information, you can spend quality time interacting with the network and discovering interesting patterns of who follows who.

--

--

Juan Felipe Alvarez Jaramillo
Juan Felipe Alvarez Jaramillo

Written by Juan Felipe Alvarez Jaramillo

Data and analytics expert, driven by curiosity and fueled by a hacker’s mentality. MSc Business Analytics from Alliance Manchester Business School.

No responses yet