Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
_Val_
Admin
Admin

White Paper - How to Batch Categorize URLs

How to Batch Categorize URLs

Author

@Sebastien_Rho 

 

You can look up the category of a website using Check Point’s URL categorization website (https://www.checkpoint.com/urlcat/main.htm)

Since the site allows you to query 1 site at a time, it could be a long process if you have a list of sites you want to query.  All commands must be made from the same folder.

1. Create a cookie for your session on the website

This is how you log into the site with curl and store your session cookie to a file… (replace email@domain.com and “password” with your UserCenter credentials):

curl_cli -k -v --cookie-jar ./cookie -X POST -d "customer=&g-recaptcha-response=&needCaptcha=false&userName=email@domain.com&password=password&userOr..." https://www.checkpoint.com/urlcat/login.htm

 

N.B.: If you have special characters in your password which might be misinterpreted by bash you may have to “escape” them with “\”

            E.g. Pas$word should be entered as Pas\$word

2. Create the list of the websites you want to query

[Expert@yourSMS]# vi sites.txt
www.yahoo.com
www.cnn.com
www.gmail.com

Then make a bash script with this content (categorize.sh):

[Expert@yourSMS]# vi categorize.sh

#!/bin/bash
while read p; do

   result=$(curl_cli -k -v --cookie ./cookie -X POST -d "action=post&actionType=submitURL&urlCategorization=$p" https://www.checkpoint.com/urlcat/main.htm 2>/dev/null | grep -A4 "Categories:" | tr -d '\n' | grep -oP '(?<=).*?(?=

)' | sed 's/^[ \t]*//')

 

   echo $p,$result

   sleep 1

done <sites.txt

done <sites.txt

When running the script, it will return all the categories that the site is associated with:

 

[Expert@yourSMS]# ./categorize.sh

 

01com.com,Computers / Internet

020jbxsgqwpse.changeip.org,Computers / Internet

022btrarqcfuk.changeip.org,Computers / Internet

026kordzsydup.changeip.org,Computers / Internet

1001-love.com,Sex

 

You can also output the script to a file to be able to save and send.

[Expert@yourSMS]# ./categorize.sh >sites.csv

 

For the full list of White Papers, go here

10 Replies
Kai_Magnussen
Participant

Hi,

 

Im getting some issues when trying to run this script. I can get the cookie, and the script lists the urls i put in the txt file, but there is no categorization of the sites, and i get a syntax error on line 14, which is done<file.txt.

 

Any suggestions on this one?

 

Br

Kai Magnussen

0 Kudos
funkylicious
Advisor

Hi,

I think it's an extra done < sites.txt .

Kai_Magnussen
Participant

hi,

 

Yes, that was the syntax issue. Now, the only remaining issue, is that its actually not categorizing the url's in the file, it just lists them.

Is there anything else that needs  to be done, regarding this?

 

Br

Kai Magnussen

0 Kudos
masher
Employee
Employee

There are several issues with this post that might not have existed last year. I used a linux host rather than my management server. However, changing curl to curl_cli will allow it to properly run from a management host or gateway.

1.) The login process needs to be updated since the login URL is different.

 curl -k -v --cookie-jar ./cookie -d "usersOriginalURL=&customer=&needCaptcha=false&userName=user%40domain.com&password=ENTERPASSWORDHERE" https://urlcat.checkpoint.com/urlcat/login.htm

 2.) I had to adjust the script as well. The grep command in the initial script did not work properly for me so I had to play around with the sed/grep/tr options to present it how I wanted. I also had to adjust the options for curl as the script kept overwriting my cookie file rather than using it.

#!/bin/bash
while read p;
do
result=$(curl -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" | tr -d '\n' | tr '\t' ' ' | tr -s ' ' | grep -oP '<\/b>.*<\/p> <d' | sed -e 's/^<\/b> //' -e 's/ <.*$//')
echo $p, $result
sleep 1
done < sites.txt

When run, it outputs as expected:

linux$ ./categories.sh
www.google.com, Search Engines / Portals
www.yahoo.com, Search Engines / Portals
www.checkpoint.com, Computers / Internet
www.amazon.com, Shopping

linux$

 

0 Kudos
Nandhakumar
Explorer

Hi,

I tested this script with Security Gateway but I didn't get the expected output. I changed only curl to curl_cli.

Getting below output when run the script. How can we get associated site category in this case?

[Expert@Test_Gateway:0]# ./categorize.sh
www.yahoo.com,
www.google.com,

0 Kudos
Johnny_L
Explorer

Hi,

I seem the script is still not including the categories in the output. I suspect that the result from the curl inside the script has a little different output for you and for me. Is there a way to confirm if the curl output is at it should?

 

 

0 Kudos
masher
Employee
Employee

I was run this on my R80.40 management successfully.  Once you have run the login command, you could run the curl command itself and trim off the grep/sed/tr from the command to make sure you're receiving data back correctly.

curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" 

 Replace the $p with a proper host.

curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=www.google.com&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" 

You should see something similar to what I have below:

$ curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=www.google.com&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:"
<p><b>Categories:</b>
Search Engines / Portals
</p>
<div id="categoryListDetails">

--
Suggested Categories: null,null <br/>
Comment: null <br/>

 

0 Kudos
masher
Employee
Employee

I put the login and query all in a quick script that might make it simpler. It prompts for UC email/password and for a file with the host name.

#!/bin/bash

echo -n "Enter usercenter email: "
read ucemail
echo -n "Enter usercenter password: "
read -s ucpw
echo
echo -n "Enter file with domain list: "
read filename
echo

echo "Logging into UC"
curl_cli -k -v --cookie-jar ./cookie -d "usersOriginalURL=&customer=&needCaptcha=false&userName=${ucemail}&password=${ucpw}" https://urlcat.checkpoint.com/urlcat/login.htm 2>/dev/null

while read p;
do
result=$(curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" | tr -d '\n' | tr '\t' ' ' | tr -s ' ' | grep -oP '<\/b>.*<\/p> <d' | sed -e 's/^<\/b> //' -e 's/ <.*$//')
echo $p, $result
sleep 1
done < $filename
rm ./cookie

 Results:

[Expert@vMgmt01:0]# ./categories.sh
Enter usercenter email: myemail@domain.com
Enter usercenter password:
Enter file with domain list: sites.txt

Logging into UC
www.yahoo.com, Search Engines / Portals
www.google.com, Search Engines / Portals
facebook.com, Social Networking
porn.com, Sex, Pornography

 

0 Kudos
Nandhakumar
Explorer

I tried this script in both Security gateway and management server from expert mode but the response would be same as I only got url's as output not the site category.

Management server I tried is MDS R80.30.

Can you let me know if I am miss something here?

0 Kudos
Nandhakumar
Explorer

The script working as expected in R80.30 and R80.40 Management server 🙂

0 Kudos