VLOOKUP and INDEX/MATCH equivalences in Pandas
In one of my previous blogs I tried to map the equivalences between Excel and Pandas for creating Pivot Tables. I am writing this new blog to do something similar for a widely used functionality (joining datasets based on common keys).
One of the most common tasks performed in Excel would be the use of VLOOKUP formula to append new columns to a dataset, using the leftmost column as a common key. The generalization of this approach is obtained by combining 2 functions (INDEX and MATCH) in order to be able to append new columns to a dataset, using any column as a common key (regardless of its position in the table). This article explains the advantages of each of them and their uses in Excel.
So, how can we achieve the same using Pandas in Python? It turns out it’s relatively easy. Bear in mind that there are several ways of achieving the same result in Python. I have selected the one that I found is both faster and easier to use: is the pd.merge function. This is what a side-to side-comparison looks like.
I will use a simple example to aid undesrtanding. This same solution can be applied to more complex datasets. Lets suppose you have 2 tables, where the common key is a Persons’s Id.
A very common task would be to expand Table 1 information to include columns from Table 2 and you could achieve this either with VLOOKUP or with INDEX and MATCH combination. Here is an example of the formulas that I would use to append the “City” column.
To do it in pandas, you would:
Here is a quick explanation of the code. Since I want to add an extra column to “table1”, I have used the pandas command merge on this table (table1.merge). Then, the first argument is the name of the second table (table2) where the extra columns I want to add are located. The “on” argument is for specifying the name of the column field to be used as key.
Note: If the tables have different column names, you can specify using the “left_on” and “right_on” parameters.
The “how” argument is also a very important one, because it enables you to do more advanced merges than the simple VLOOKUP. I have used the “left” type of join because I want to use the table1 as a basis and add extra columns to table 2, based on the common fields (just as a VLOOKUP would work). However, pd.merge allows me to other kinds of joins that could be very useful depending on what you want to achieve. This figure depicts the logic behind each kind of join:
Using the “left” kind of join, we get some Nans for the Ids that are present in Table 1, but are not present in Table 2.
There is an extra argument you could add to your code in order to replace thos Nans with a ‘Uknown’ category:
Now, the output should print like this:
I hope this blog is useful to illustrate a practical case. This is a link to the merge function documentation in case you want to learn more about the other parameters the function can take.
If you want to connect and have a conversation about this or any other topic, please feel free to find me on LinkedIn: