- CheckMates
- :
- Products
- :
- General Topics
- :
- Re: White Paper - How to Batch Categorize URLs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
White Paper - How to Batch Categorize URLs
How to Batch Categorize URLs
Author
You can look up the category of a website using Check Point’s URL categorization website (https://www.checkpoint.com/urlcat/main.htm)
Since the site allows you to query 1 site at a time, it could be a long process if you have a list of sites you want to query. All commands must be made from the same folder.
1. Create a cookie for your session on the website
This is how you log into the site with curl and store your session cookie to a file… (replace email@domain.com and “password” with your UserCenter credentials):
curl_cli -k -v --cookie-jar ./cookie -X POST -d "customer=&g-recaptcha-response=&needCaptcha=false&userName=email@domain.com&password=password&userOr..." https://www.checkpoint.com/urlcat/login.htm
N.B.: If you have special characters in your password which might be misinterpreted by bash you may have to “escape” them with “\”
E.g. Pas$word should be entered as Pas\$word
2. Create the list of the websites you want to query
[Expert@yourSMS]# vi sites.txt
www.yahoo.com
www.cnn.com
www.gmail.com
Then make a bash script with this content (categorize.sh):
[Expert@yourSMS]# vi categorize.sh
#!/bin/bash
while read p; do
result=$(curl_cli -k -v --cookie ./cookie -X POST -d "action=post&actionType=submitURL&urlCategorization=$p" https://www.checkpoint.com/urlcat/main.htm 2>/dev/null | grep -A4 "Categories:" | tr -d '\n' | grep -oP '(?<=).*?(?=
)' | sed 's/^[ \t]*//')
echo $p,$result
sleep 1
done <sites.txt
done <sites.txt
When running the script, it will return all the categories that the site is associated with:
[Expert@yourSMS]# ./categorize.sh
01com.com,Computers / Internet
020jbxsgqwpse.changeip.org,Computers / Internet
022btrarqcfuk.changeip.org,Computers / Internet
026kordzsydup.changeip.org,Computers / Internet
1001-love.com,Sex
You can also output the script to a file to be able to save and send.
[Expert@yourSMS]# ./categorize.sh >sites.csv
For the full list of White Papers, go here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Im getting some issues when trying to run this script. I can get the cookie, and the script lists the urls i put in the txt file, but there is no categorization of the sites, and i get a syntax error on line 14, which is done<file.txt.
Any suggestions on this one?
Br
Kai Magnussen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I think it's an extra done < sites.txt .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
Yes, that was the syntax issue. Now, the only remaining issue, is that its actually not categorizing the url's in the file, it just lists them.
Is there anything else that needs to be done, regarding this?
Br
Kai Magnussen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are several issues with this post that might not have existed last year. I used a linux host rather than my management server. However, changing curl to curl_cli will allow it to properly run from a management host or gateway.
1.) The login process needs to be updated since the login URL is different.
curl -k -v --cookie-jar ./cookie -d "usersOriginalURL=&customer=&needCaptcha=false&userName=user%40domain.com&password=ENTERPASSWORDHERE" https://urlcat.checkpoint.com/urlcat/login.htm
2.) I had to adjust the script as well. The grep command in the initial script did not work properly for me so I had to play around with the sed/grep/tr options to present it how I wanted. I also had to adjust the options for curl as the script kept overwriting my cookie file rather than using it.
#!/bin/bash
while read p;
do
result=$(curl -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" | tr -d '\n' | tr '\t' ' ' | tr -s ' ' | grep -oP '<\/b>.*<\/p> <d' | sed -e 's/^<\/b> //' -e 's/ <.*$//')
echo $p, $result
sleep 1
done < sites.txt
When run, it outputs as expected:
linux$ ./categories.sh
www.google.com, Search Engines / Portals
www.yahoo.com, Search Engines / Portals
www.checkpoint.com, Computers / Internet
www.amazon.com, Shopping
linux$
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I tested this script with Security Gateway but I didn't get the expected output. I changed only curl to curl_cli.
Getting below output when run the script. How can we get associated site category in this case?
[Expert@Test_Gateway:0]# ./categorize.sh
www.yahoo.com,
www.google.com,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I seem the script is still not including the categories in the output. I suspect that the result from the curl inside the script has a little different output for you and for me. Is there a way to confirm if the curl output is at it should?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was run this on my R80.40 management successfully. Once you have run the login command, you could run the curl command itself and trim off the grep/sed/tr from the command to make sure you're receiving data back correctly.
curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:"
Replace the $p with a proper host.
curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=www.google.com&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:"
You should see something similar to what I have below:
$ curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=www.google.com&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:"
<p><b>Categories:</b>
Search Engines / Portals
</p>
<div id="categoryListDetails">
--
Suggested Categories: null,null <br/>
Comment: null <br/>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I put the login and query all in a quick script that might make it simpler. It prompts for UC email/password and for a file with the host name.
#!/bin/bash
echo -n "Enter usercenter email: "
read ucemail
echo -n "Enter usercenter password: "
read -s ucpw
echo
echo -n "Enter file with domain list: "
read filename
echo
echo "Logging into UC"
curl_cli -k -v --cookie-jar ./cookie -d "usersOriginalURL=&customer=&needCaptcha=false&userName=${ucemail}&password=${ucpw}" https://urlcat.checkpoint.com/urlcat/login.htm 2>/dev/null
while read p;
do
result=$(curl_cli -k -q -b ./cookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" | tr -d '\n' | tr '\t' ' ' | tr -s ' ' | grep -oP '<\/b>.*<\/p> <d' | sed -e 's/^<\/b> //' -e 's/ <.*$//')
echo $p, $result
sleep 1
done < $filename
rm ./cookie
Results:
[Expert@vMgmt01:0]# ./categories.sh
Enter usercenter email: myemail@domain.com
Enter usercenter password:
Enter file with domain list: sites.txt
Logging into UC
www.yahoo.com, Search Engines / Portals
www.google.com, Search Engines / Portals
facebook.com, Social Networking
porn.com, Sex, Pornography
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried this script in both Security gateway and management server from expert mode but the response would be same as I only got url's as output not the site category.
Management server I tried is MDS R80.30.
Can you let me know if I am miss something here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The script working as expected in R80.30 and R80.40 Management server 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Masher,
This took a few attempts to get working but can confirm this works on Ubuntu 22 LTS
Steps taken:
1. Create Cookie:
curl -k -v --cookie-jar ./checkpointcookie -d "usersOriginalURL=&customer=&needCaptcha=false&userName=username%40domain.com&password=PasswordHere" https://urlcat.checkpoint.com/urlcat/login.htm
2. Create Script:
#!/bin/bash
while read p;
do
result=$(curl -k -q -b ./checkpointcookie -d "action=post&actionType=submitURL&needCaptchaForMiscat=true&urlCategorization=$p&ticketIDToLookFor=" https://urlcat.checkpoint.com/urlcat/main.htm 2>&1 | grep -A4 "Categories:" | tr -d '\n' | tr '\t' ' ' | tr -s ' ' | grep -oP '<\/b>.*<\/p> <d' | sed -e 's/^<\/b> //' -e 's/ <.*$//')
echo $p, $result
sleep 1
done < sites.txt
3. Make script executable:
chmod u+x FilenameHere
4. Create Site List as Sites.txt with one domain per line
5. Run script
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, this script has worked for several years, but now it fails. Anyone else have problems with it recently?
