Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Timothy_Hall
Champion Champion
Champion

Custom Sites and RegExp Wildcard Efficiency

For Custom Application/Site object definitions sk165094 recommends avoiding wildcards like "*" as much as possible, as using them increases the load on the gateway's pattern matcher as measured by fw pm_stats.  However a very good question came up during the most recent run of my Gateway Performance Optimization class.

Below are two different Custom Application/Site objects that both successfully accomplish the following three goals:

1) Match website shadowpeak.com

2) Match all subdomains of shadowpeak.com (i.e. www.shadowpeak.com, www.shop.shadowpeak.com) even if there is more than one subdomain.

3) Do NOT match a domain like SCAMshadowpeak.com

The first candidate object does not have the regular expressions checkbox set:

noregexp.png

 

The second candidate object does have the checkbox set, but successfully avoids the use of the "*" wildcard:

regexp.png

 

Both of these object definitions accomplish the three stated goals, however the install policy operation takes much, much longer when the second object utilizing regular expressions is created; presumably this is triggering a recompilation of the entire pattern matcher database.

Two questions:

1) Once the policy is successfully installed to the gateway, which of these objects will do the job most efficiently from a CPU consumption perspective for the Pattern Matcher?  Part of me thinks the first one will since the regexp checkbox is not set, but the other half of me thinks it would be the second object since we are avoiding the use of "*", but we now have the regexp checkbox set which may cause additional overhead?

2) When the SK says to avoid wildcards, obviously "*" is being referred to as matching zero or more characters.  What other ones should be avoided for performance reasons?  I'm assuming regexp constructs that match 0/1 "or more" characters are the ones to avoid, as the "or more" concept requires the pattern matcher to cycle through many different possible combinations.  So would the following list be authoritative as to which regex constructs should be avoided:

*     Matches the previous element zero or more times.
+     Matches the previous element one or more times.
*?   Matches the previous element zero or more times, but as few times as possible.
+?   Matches the previous element one or more times, but as few times as possible.

Like many of my esoteric questions, this will almost certainly have to be answered by R&D so I'm tagging @PhoneBoy 

Edit: I saw the term "greedy matching" in another thread which is what I think we want to avoid here for performance reasons.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
9 Replies
Marcel_Gramalla
Advisor

A question asked so often and never really answered...hopefully we get some real insights this time. And funny enough the CPX 2023 HTTPS Inspection Best Practices presentation even says "Use wildcards - less URLs" pointing to the sk saying avoid using it 😅

(1)
the_rock
Legend
Legend

I know, right? lol

EVERY time I was on remote with TAC (regardless if it was T2, T3 or esc person), we always ended up using wildcards to make this work. Seems to me that whats communicated to customers varies big time : - )

0 Kudos
John-Haynes
Participant

When working on an RAD issue a couple of years ago, R&D said that we should only really use regular expressions due to the performance advantages.

0 Kudos
Bob_Zimmerman
Authority
Authority

Did you also check whether they match the domain name in the username portion or path of the URL? I've been seeing a lot of spam with well-known domains in the username of the URL lately. How about case? For example:

http://shadowpeak.com@totallynotphishing.info/

http://totallynotphishing.info/shadowpeak.com/

http://ShadowPeak.COM/

I'd love documentation on the input to Check Point's match space. Does it always include the scheme? If a username is specified in the URL, is that included in the input to the match? What about the password? Port number? Is there any normalization (domain names aren't case-sensitive, but paths are)?

From experimentation, I know the answers to some of these. Would still be nice to have real documentation providing official answers to help optimize matching expressions.

0 Kudos
Danny
Champion Champion
Champion

Related Links for Custom Sites/Applications

Recommendations

  • Don't use * as it puts high load to the Pattern Matcher on the Security Gateway (it doesn't matter if it's with or without Regex)
  • Don't put http: or https: in the string of the custom site
  • Always put a / at the end of non-Regex domains
  • If a special subdomain can be referenced, such as  www.sample.com/ avoid Regex and directly reference it
  • Verify the common name of the custom site and test with this one as well, if it's different

Special considerations

  • Regex syntax implicitly starts and ends with .*
  • Non-Regex syntax implicitly ends with *
  • Custom applications are matched only with the payload of a connection

Risk mitigation

  • Many syntaxes allow more than intended, thoughtfully plan and test your syntax
  • Workarounds might cause performance impacts, though they are always a good read
  • Learn Regex! Verify your Regex syntax with online Regex generators. Understand your Regex!

Common mistakes

  • checkpoint.com matches for checkpoint.com.crime.org
  • *checkpoint.com/ matches for crime.org/checkpoint.com/
  • *.checkpoint.com/ matches for crime.org/www.checkpoint.com/ 
  • Regex \/checkpoint.com\.com matches for crime.org/checkpoint.com/
  • Regex \.checkpoint\.com matches for www.checkpoint.com.crime.org
the_rock
Legend
Legend

Excellent advice, as always.

0 Kudos
Timothy_Hall
Champion Champion
Champion

Great tips, thanks Danny and I will work them into the course.  Still hoping someone from R&D can answer my original question as I couldn't find any easy way to measure CPU utilization by the pattern matcher while utilizing either version of the ShadowPeak custom object, the fw pm_stats command is supposed to show you that information but its output is not exactly easy reading.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
PhoneBoy
Admin
Admin

Pretty sure the underlying infrastructure used in both cases is the Pattern Matcher (used by multiple blades).
A properly constructed regex should perform better than a wildcard.

I tend to agree with your list of things to avoid.
Basically, the less precise the regex, the more pattern matcher has to work to do so. 

0 Kudos
Timothy_Hall
Champion Champion
Champion

I received a private reply from the owner of sk165094, and they have clarified that the authoritative list of wildcards to avoid if possible in Custom Site/Application objects are:

  • .*

So it would appear my second example is the preferred method.  I notified this individual of the presence of this thread and invited them to chime in if they like.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events