Python Forum

Full Version: Classify URLs
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi there,

I'm starting learning Python to be able to replace excel and be more efficient. I have a dataframe with a column called 'Source' which contains a list of URLs. I want to create a new column called 'Category' which will have the category of the URL. The category will be based on the strings inside the URL.

When I look for this type of URL categorisation, it seems that most of the people use more difficult and advance systems. Is it not possible to use a if function for this purpose?

I have tried this code but it seems that the syntax is not right. Any advice on where to look to find out simple ways to classify URLs?

category = def categories (df) = {
    if (df['Source'].str.contains("/string1",regex=True)): 'Category 1',
    elif (df['Source'].str.contains("/string2",regex=True)): 'Category 2',
    else: other 
}
df['Category']= category
df.head()
Many thanks!
If I may, perhaps a better starting point, at least for us to be able to better help you, would be to define your classification and how your data can be matched to that classification.

As far as syntax goes, I am not 100% sure what you are trying to achieve but here are a few pointers:
  • On line 1, you are trying to assign, I assume, the result of a function to the variable category. If you truly want to assign a function to category, you would need to define the function first and then assign it to category.
  • To my knowledge, brackets are never used in Python to enclose statements, they are used to define dictionaries and sets. The proper syntax to declare a function is to end the def statement with a semi-colon, return to a new line and indent the new line.
  • Your else statement returns a variable named other, which I do not see being defined here. Did you intend to return a string instead ? You would be missing quotes if so.
  • Overall, it looks like you want to apply a function to the Source data field, which can be done like so:
def categories(Source):
    if '/string1' in Source: return 'Category 1'
    elif: '/string2' in Source: return 'Category 2'
    else: return 'other'

df.loc[:, 'Category'] = df.Source.apply(categories)
I used the 'in' notation but you can use regex as well if that is what you really need.
Hi boring_accountant,

Many thanks for your help. I'll try to remember that first I need to define the functions, instead of adding it all inside. And I'll try to understand better when to use the brackets.
What I would like with the else statement is that if the URL doesn't contains any of the mentioned strings then say other.

Well, to answer your question, I want to be able to classify the URLs depending of the keywords they contain. The structure of the url doesn't match the structure if the website, and it makes difficult to classify the URLs.
Source is the field / column in the dataframe that contains the urls, I'm trying to analyse internal linking.

For example,

https://www.website.com/category1/subcategory1
https://www.website.com/subcategory2
https://www.website.com/subcategory3

My goal is to be able to create a column for the main category it belongs and then also other column for the subcategory. So I was thinking that if I define a list of keyword and then assign to a variable, and then create a if loop maybe I could have it done. But I'm just starting with Python and I guess I need keep trying.

I did small changes to the code when I was seeing some errors in the console, but still gives me an error as if the dataframe was not recognised.
def categories(Source):
    if '/string1' in Source: return 'Category 1'
    elif '/strinng1' in Source: return 'Category 2'
    else: return'other'
 
df_filtered.loc[:, 'Category'] = df_filtered.Source.apply(categories)
I forgot to mention than i did some filtering and the name of the dataframe was now df_filtered. But the error mention:
NameError: name 'df_filtered' is not defined

Thanks again!!
It would be helpful to see a bit more of your code, especially how you define df_filtered. From the error message you posted it looks quite simply like you didn't define df_filtered. Some possible causes:
  • variable definition comes after the first time the variable is actually used
  • variable definition uses a slightly different name
  • variable definition is in another scope (e.g. variable defined in a function scope and used in the global scope
Hi again Smile


You talk about variables, but is it not a dataframe, filtered_df? Which is the variable that you're mentioning, the dataframe itself?

What I did so far was only:

Import some of the libraries needed
import pandas as pd
import numpy as np
import re
from IPython.display import display
Import the file
xlsx = pd.ExcelFile("excelfile.xlsx")
Read each of the tabs in the file
df1 = pd.read_excel(xlsx, "Tab1")
df2 = pd.read_excel(xlsx, "Tab2")
df3 = pd.read_excel(xlsx, "Tab3")
Concatenate the three tabs from the excel file and create a dataframe with the data
dataframe = [df1,df2,df3]
df = pd.concat(dataframe, ignore_index=True)
df.head()


Then I'm filtering the data contained in the dataframe
[python]
df_filtered = df[(df['Destination'].str.contains("website.com",regex=True)==True)&(df['Source'].str.contains("website.com",regex=True)==True)&(df['Type']== "AHREF")]
df_filtered.head(2)
Then I'm trying to created the categories in the filtered dataframe
def categories(Source):
    if '/string1' in Source: return 'Category 1'
    elif '/string1' in Source: return 'Category 2'
    else: return 'other'
 
df_filtered.loc[:, 'Category'] = df_filtered.Source.apply(categories)
Which course could give me some understanding of all those rules? I'm reading "Python for Data Analysis" but it is hard to remember all of this. I guess I need to keep practising.

Many thanks
Hi again,

A variable is just a name to which you can assign values or other objects. df_filtered is a variable of type pandas.DataFrame.

As to the issue at hand, df_filtered appears to me like it should be defined. I ran a quick test at home and didn't get the error message you are obtaining:
import pandas as pd
# Setting up the mock dataframe
df = pd.DataFrame({
    'Destination': ['website.com/test1', 'www.website.com/test2', 'www.somethingelse.com'], 
    'Source': ['website.com/string1', 'website.com/string2', 'website.com/somethingelse'], 
    'Type': ['AHREF', 'AHREF', 'AHREF']
})

# Simplifying your code to filter the DataFrame
# Note that the change from df[] to df.loc[] is to prevent some
# issues you needn't worry about right now
# If you do want, you can look it up here:  
# http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df_filtered = df.loc[(df.Destination.str.contains("website.com")) & 
    (df.Source.str.contains("website.com")) & 
    (df.Type == "AHREF"), :]

def categories(Source):
    if '/string1' in Source: return 'Category 1'
    elif '/string2' in Source: return 'Category 2'
    else: return 'other'

df_filtered.loc[:, 'Category'] = df_filtered.Source.apply(categories)
print(df_filtered)
Output:
Destination Source Type Category 0 website.com/test1 website.com/string1 AHREF Category1 1 www.website.com/test2 website.com/string2 AHREF Category2
Did you run all of your code in consecutively, in the order shown in your post ? If you copy and paste my code in this current post, do you get the same error message ?

As for your other question on remembering how to complete some of those tasks, I personally learned by practicing a lot. Follow some tutorials / books, try the codes yourself, play with them, modify them, etc. Try to read the documentation on some libraries you find interesting. You'll learn about functions you may need but didn't know existed or maybe you'll just keep in the back of your head that function xyz exists and can be used in scenario abc.

Cheers
Many thanks!!

I'll try out again. Maybe the file is more complex and that's why but if it works for you it might be ok. I don;t have access today to the computer I had the information but I'll let you know.

Many thanks for the advice, I'll try to do some more courses until pandas is tattoed in my brain :)

Last question, do you know if I could create a list of keywords, and if the keywords are there then is a Category x? So instead of if "/string1", there I can apply a list of keywords....or would be wrong?

Thanks again for your help!!! Much appreciated!!
Quote:Last question, do you know if I could create a list of keywords, and if the keywords are there then is a Category x? So instead of if "/string1", there I can apply a list of keywords....or would be wrong?
Absolutely. You can do something like this:
if any(item in Source for item in ['/string1', '/abc']): return 'Category 1'
This combines the function any() which returns True if any items passed to it is True. I passed to it a list comprehension that loops through all items in a list and indicates whether they are contained in Source.
Many thanks for all your help!!!!