Overview
The u32 filter allows you to match on any bit field within a packet, so
it is in some ways the most powerful filter provided by the Linux
traffic control engine. It is also the most complex, and by far the
hardest to use. To explain it I will start with a bit
of a tutorial.
Matching
The base operation of the u32 filter is actually very simple.
It extracts a bit field from a 32 bit word in the packet, and if it is
equal to a value supplied by you it has a match. The 32 bit word must
lie at a 32 bit boundry. The syntax in tc is:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:1
match u32 0xc0a80800 0xffffff00 at 12
The first line uses the same syntax shared by all filters, so I will
ignore it for now. The second line just says that if the filter matches
assign the packet to class 1:1. The third line is the interesting one;
this is what it means:
match u32 This keyword introduces a match condition. The u32
is the type of match. It must be followed by a value
and mask. A u32 match extracts a 32 bit word out of
the header, masks it and compares the result to the
supplied value. This is in fact the only type of
match the kernel can do. Tc "compiles" all other
types of matches into this one.
0xc0a80800 This is the value to compare the masked 32 bit word
to. If it is equal to the masked word the match is
successfull.
0xffffff00 This is the mask. The word extracted from the
packet is bit-wise and'ed with this mask before
comparision.
at 12 This keyword tells the kernel where the 32 bit word
lives in the packet. It is an offset, in bytes,
from the start of the packet. So in this case
we are loading the 32 bit word that is 12 bytes from
the start of the packet. The offset is optional.
If not supplied it defaults to 0 which is generally
not what you want.
Now if you look at rfc791 you will see that the source address is stored
at offset 12 in an IP packet. So the match condition could be read as:
"match if the packet was sent from the network 192.168.8.0/24". To use
the u32 filter you do have to be familiar
with the fields in IP and TCP, UDP and ICMP headers. But you don't have
to remember the offsets of the individual fields - tc has some syntatic
sugar for that. This command has does the same thing as the one above.
The syntax is different, but the filter submitted
to the kernel is identical:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:1
match ip src 192.168.8.0/24
A u32 filter item can logically "and" several matches together,
succeeding if only if all matches succeed. This example will succeed
only if the packet was sent from network 192.168.8.0/24, and has a TOS
of 10 hex:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:1
match ip src 192.168.8.0/24
match ip tos 0x10 1e
You can have as many match conditions on the one line as you want. All must be successful for the filter item to score a match.
If you enter several tc filter commands the filters are tried in turn until one matches. For example:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:1
match ip src 192.168.8.0/24
match ip tos 0x10 1e
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:2
match ip src 192.168.4.0/24
match ip tos 0x08 1e
The first filter item checks if the packet is from network 192.16.8.0/24
and has a TOS of 10 hex. If so the packet is assigned to class 1:1. If
not the second filter item is tried. It checks if the packet is from
network 192.168.0.4/24 and has a TOS of 08 hex
and if so it will assign to packet to class 1:2. If not the next filier
item would be tried. But there is none, so u32 filter fails to classify
the packet.
Now it is time to discuss u32 handles. A u32 handle is actually 3
numbers, written like this: 800:0:3. They are all in hex. For now we are
only interested in the last one. This last number identifies the filter
items we have been adding. Because we did not
specify an number generated for the filter item the kernel allocated
one for us. In fact it allocated the handles 800:0:800 and 800:0:801.
The handle it generates is one bigger than the largest handle used so
far, with a minum value of 800 hex. Valid filter
item handles range from 1 to ffe hex. Like all filter handles, the
complete handle (as in 800:0:801) must be unique. We can force a
particular handle to be used for a filier item by using the "handle"
option of "tc filter", like this:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:1
match ip src 192.168.8.0/24
match ip tos 0x10 1e
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 handle ::1 u32
classid 1:2
match ip src 192.168.4.0/24
match ip tos 0x08 1e
These tc commands are almost identical to the previous example. In fact
the tc command creating the first item is identical, so it will be
allocated the same handle as before, 800:0:800. The second command only
differs from the previous example in that it specifies
item handle 1 is to be used. (The rest of the numbers in the handle are
not specified, so the defaults are used.) The full handle created for
the second filter item will be 800:0:1. The kernel evaluates filter
items in handle order, with lower handle numbers
being checked first. So the impact of doing this will be to reverse the
order they two filter items are evaluated by the kernel, compared to
the previous example.
Linking
Before proceeding we need a new concept. In effect filter items that
share the same prefix in their handle (800:0 in the above examples) form
a numbered list. The number is the filter item number, ie the last
number is the handle. In the last example above
we had a two item list with these handles:
list 800:0:
1 [src=192.168.4.0/24, tos=0x08] -> return classid 1:2
800 [src=192.168.8.0/24, tos=0x10] -> return classid 1:1
I will call this a u32 filter list, or just a filter list for short. The
prefix (800:0 in this case) can be used as a handle to identify the
list. In the section above I described how the kernel "executes" such a
list. To recap it does this by running through
the list in filter item number order, checking each filter item in turn
to see if it matches. If a filter item matches it can classify the
packet, in which case the u32 filter stops and returns the classified
packet. But when a u32 filter item matches a packet
there is one other thing it can do besides classifing the packet. It
can "link" to another u32 filter list. For example:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
link 1:0:
match ip src 192.168.8.0/24
If this filter item matches it will "link" to filter list 1:0:, meaning
the kernel will now execute filter list 1:0:. If a filter item in that
list matches and classifies a packet then the u32 stops and returns
classified packet. If that does not happen, ie
if no filter item in the list classifies the packet then the kernel
resumes executing the original list. Execution continues at the next
filter item in the original list, ie the one after the filter item that
did the "link". A linked list can in turn link
to other lists. You can nest up to 7 link commands.
If you specify a "link" command for a filter item any attempt to
classify a packet in the same filter item will be ignored. Another way
of saying this is the "classid" option and its aliases won't work if you
put "link" on the command line.
This linking is not in itself very useful. It is usually faster to use
one big list, and it always easier to do it that way. But there are two
commands you can combine with the "link" command, and in fact neither
can be used without it.
Hashing
The filter lists we have been discussing are actually part of much
larger structures called hash tables. A hash table is just an an array
of things that I will call buckets. A bucket contains one thing: a
filter list. This will all become clear shortly, I hope.
We can now look at the meanings of the other two numbers in a u32 filter
handle. One handle in the examples above was 800:0:1. Well, the 800
identifies the hash table, and the 0 is the bucket within that hash
table. So 10:20:30 means: filter item 30, which
is located in bucket 20, which is located in hash table 10.
Hash table 800 is special. It is called the root. When you create a u32
filter the root hash table gets created for you automaically. It always
has exactly one bucket, numbered 0. This means the root hash table also
exactly one filter list associated with it.
When the u32 filter starts execution it always executes this filter
list. In other words a u32 filter does its thing by executing filter
list 800:0. If filter list 800:0 does not classify the packet (implying
that none of the lists it linked to clasified it
either) then the u32 filter returns the packet unclassified.
Not unsurprisingly you can't delete the root hash table. Actually you
can't delete any other hash table either (as of 2.4.9), but that is
because of a bug in the in the kernel u32 filter's reference counting
code. The only way to get rid of a hash table in
2.4.9 or earlier is to delete the entire u32 filter.
Hash tables other than the root must be created before you can add
filter items that link to them. Use this tc command to create hash
table:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 handle 1: u32
divisor 256
This creates a hash table with 256 buckets. The buckets are numbered 0
through to 255. So we have effectively created 256 filter lists with
handles 1:0, 1:1, ... 1:255. A hash table can have 1, 2, 4, 8, 16, 32,
64, 128 or 256 buckets. Other values are possible
but can be very inefficient. The kernel has a bug that will allow you
to have 257 buckets, but doing that may cause an oops.
If you omit the "handle" option the kernel will allocate you a new
handle. Currently (2.4.9) the kernel has a bug - the handle allocation
routing will go infinite rather than return failure in the very unlikely
circumstance that all hash table handles are in
use.
The way the tc "link" option is written it might appear that you can
link to any bucket. You can't. The link option only allows you to
specify bucket 0 (implying that "link 1:1" is illegal). To select a
bucket other than 0 you must use the "hashkey" option:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
link 1: hashkey mask ffffff00 at 12
match ip src 192.168.8.0/24
The hashkey option causes the kernel to calculate the bucket number of
the filter list to link to from data in the packet. You get to specify
what data. This operation is usually called a hashing. In this case the
hash it is a particularly fast but primitive
one. In the example above this is what happens, in detail, if the match
succeeds:
The kernel reads the 32 bit word at offset 12 of the packet being sent.
This word is masked with ffffff00
The word is left shifted by 8 bits, and them masked with 0xff. The
amount of left shift is calculated from the mask - it is the number of
bits the mask has to be shifted so the first 1 bit appears in the least
significant bit. The is the "hashing function".
It changed between 2.4 and 2.6. In 2.4 the 4 bytes in the word are
xor'ed together. From what I have seen, the 2.4 version did a better job
on real data.
The result of the hash, which is a number in the range 0..255, is then masked with (number of buckets - 1).
The result is a bucket number, which is then combined with the hash table in the link option to form a filter list handle.
That filter list is then executed.
If you look at rfc791 you will see the hash in the example is selecting
the senders network address. Tc offers no syntatic sugar to help you
this time, ie there is no "hashkey ip src 0.0.0.0/24" or similar. You to
do it the hard way and look up the rfc's.
Why would you hash on the source network rather than testing for it in a
match option? Its only useful if you want to classify a packet based on
a lot of different source networks. If there is only one or two source
networks you are better off using match as
doing a couple of matches is faster than doing a hash. But, the amount
of time required to test all the matches will grow as number of source
networks grows. Hashing on the other hand takes a fixed amount of time
regardless whether there is 1 or 100 source
networks, so if there are thousand's of source networks hashing is
going to be literally 100's of times faster than testing them one by one
using matching.
I mention this because there is an example from Alexey's
"README.iproute2+tc" that selected the TCP protocol (among others) using
hashing. As an example of how to use hashing it is good, but it has
been cut and pasted by every man + dog, altered to only select
the TCP protocol, and then quoted as the way to do it. Wrong. A simple
"link" without hashing would be better in that case.
We have dealt with one side of hashing - how the filter list to be
executed (hash table, bucket) is selected. There is a second side to it -
adding items to the selected filter list. The problem is really quite
simple - which one is it? You know the hash table
number, it is the bucket number that is the problem. You could use the
description of the hashing algroithm above and manually calculate the
bucket number. That is a bad option for two reasons. Firstly, its hard
work in the general case. Secondly its fragile,
because the hashing algroithm in the kernel can and has changed. Tc can
calculate the hash for you, and it is better and easier to let it do
so. Letting Tc do this does not effect time it takes the kernel to
execute the filter. Here is how you do it:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:2
ht 1: sample ip src 192.168.8.0/8
match ip src 192.168.8.0/8
match ip tos 0x08 1e
Caveat: as of 07/Feb/2006, the hashing algorithm in tc is still at the
2.4 version(!). Ergo, for 2.6 tc ends up with the wrong answer, so this
example above won't work until this bug is fixed.
The line in question is the third one - the rest we have seen before.
The "ht 1:" says the filter item is to be inserted into hash table 1::.
The "sample ..." says what value we want to calculate has bucket number
for, ie we want that value to be hashed. Tc
will apply the same hashing algorithm used by the kernel to calculate
the bucket number. The "..." can be anything that could legally follow a
"match" option, so all the syntatic sugar for calculating IP offsets is
available to you.
There is, unfortunately, three bugs in the current version of "tc" (ie
tc up until cvs 2006-02-09), which render "sample" useless. Firstly,
"sample" assumes the target hash table has 256 buckets. If it doesn't,
you are out of luck - you must use the "ht" option
instead. Secondly, the "sample" option always uses the 2.4 kernel
hashing function. Ie, it doesn't work on 2.6 kernels. Finally, the
"sample" parsing code in "tc" has a bug (a missing memset()), which
causes tc to get segmentation violations. This last bug
renders it completely useless.
Now for some random points. First of all, why did I not use the "handle"
option of tc to specify the hash table, as is done everywhere else?
Answer: because you must give the "ht" option. You can also give the
"handle" option, but if you do the hash table number
in it must be blank, (as in ::1), or be equal to the hash table given
in the "ht" option. Is there a good reason why tc and the kernel work
like this? No, not that I can see.
Secondly, why is the fourth line required in the command? First of all,
perhaps isn't obvious why it may not be required. It may not be needed
because the "sample" option has already selected the filter list for
this source network. If no other source networks
hash this this same bucket there is indeed no reason to for the match
command. But if several source networks hash to the same bucket it is
required - the filter won't work without it. If you are hashing for a
good reason, ie to speed up the process of selecting
among many possibilities, and you are being conservative, ie you assume
you don't know the internals of the hashing algroithm, then you can
never be sure that each bucket will only have one filter item. So this
match line should always be present.
Thirdly, there are many examples on the net that hash on the IP
protocol, then selects protocol 6 directly using "ht 1:6:" rather than
using the "sample" option. Should I copy that? Answer: No. This example
should sound familiar. It is the same cut & paste
(aka hack, because they always try to improve the example) from
Alexey's "README.iproute2+tc" file I referred to earlier. In that
example Alexey assumed he knew how the hashing algroithm worked. It
probably sounded like a reasonable assumption to him - he
designed and coded the algorithm. But it is not a good assumption for
the rest of us. He did this because under the current hashing algorithm
the value is trival to calculate under some circumstances. If you are
selecting one byte from the packet on a byte
boundry, and use a hash table 256 elements long, then the byte always
hashes to itself. The IP protocol byte meets those conditions.
Fourthly, should I allocate my own handles to filter items in a hash
bucket? Answer: Avoid it if possible. You can manually allocate filter
item numbers using the handle option, as in "handle ::1". If you do so
be sure to allocate a unique filter item number
to each filter item in the hash table (as opposed to unique to just the
bucket the filter item lives it). You have to do this if you assume (as
you should) that you don't know what bucket the filter item is going to
hash to. But, as I said earlier, avoid it
if possible. You would not be hashing if there weren't a lot of filter
items to choose from. And if there are a lot of them doing your own
filter item numbering will be painful.
Header Offsets
The IP header (and other headers) are variable length. This creates a
problem if you are trying to use "match" to look at a value in a header
that follows - you don't know where it is. It is not an impossible
problem because every header in an IP packet contains
a length field. The "header offsets" feature of u32 allows you to
extract that length from the packet, and then add it to the offset
specified in the "match" option.
Here is how it works. Recall that the match option looks like this:
match u32 VALUE MASK at OFFSET
I said earlier that OFFSET tells the kernel which word in the packet to
compare to VALUE. That statement was a simplification. Two other values
can be added to OFFSET to determine which word to use. Both those values
start off as 0, but they can be modified
when a "link" option calls another filter list. Any modification made
only applies while called filter list is being executed as the old
values are restored if the called filter list fails to classify the
packet. Here are the two values and the names I call
them:
permoff This value is unconditionally added to every OFFSET
that is done in the destination link, ie that one
that is called. This includes calculations of new
permoff's and tempoff's. Permoff's are cumulative
in that if the destination link calls another link
and calculates a new permoff, the result is added to
this one.
tempoff A "match" option in the destintaion link can optionally
add this value its OFFSET. Tempoff's are temporary, in
that it does not apply to any links the destination link
calls. It also does not effect the calculation of
OFFSET's for new permoff's and tempoff's.
Time for an example. Consider this command:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
link 1: offset at 0 mask 0f00 shift 6 plus 0 eat
match ip protocol 6 ff
The match extression selects tcp packets (which is IP protocol 6). If we
have protocol 6 we execute filter 1:0. Now for the rest of it:
offset This signals that we want to modify permoff or tempoff
if the link is executed. If this is not present,
neither permoff nor tempoff are effected - in other
words the target of the link inherits the current
permoff and tempoff.
at 0 This says the 16 bit word that contains the value we
are going to use to calculate permoff or tempoff lives
offset 0 the IP packet - ie at the start of the packet.
This offset must be even. If not specified 0 is used.
mask 0f00 This mask (which is in hex) is bit-wise anded with the
16 bit word extracted from the packet header. It
isolates the header length from the rest of the
information in the word. If not specified 0 is used
for the extracted value.
shift 6 This says the word extracted is to be divided by 32
after being masked. If not present the value is not
shifted.
plus 0 After extracting the word, masking it and dividing it by
32, this value is now added to it. If not present is
assumed to be 0.
eat If this is present we are calculating permoff, and the
result of the calculation above is added to it. Tempoff
is set to 0 in this case. If this is not present we are
calculating tempoff, and the result of the calculation
becomes tempoff's new value. Permoff is not altered in
this case.
If you don't understand this then accept at face value that it does
calculate the position of the second header in an IP packet. Copy &
paste it into your scripts. I am not going to try and explain it
further. You should of course dig out rfc791 and verify
it for yourself. That way you will be able to apply it to headers
beyond the second one.
Having calculated your offset you can now add entries to the destination
filter list that depend on it. Here is an example entry:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:4
ht 1:0
match u32 0x140000 ffff0000 at nexthdr+0
We have see almost all of this before. "ht 1:0" inserts this filter item
into hash table 1, bucket 0. "classid 1:4" classifies the packet if the
filter matches. The "match" selects protocol 14 hex (which is 20
decimal - ftp). The "at nexthdr+0" is the only
new bit, or at least the "nexthdr+" is new. The "0" sort of means the
same thing as it always did - that the 32 bit word that contains the TCP
port is at offset 0. But it is offset 0 from the TCP header, because
either permoff of tempoff has been set to point
to that header. As for "nexthdr+", recall that adding "tempoff" was
optional. If you add "nexthdr+" it gets added. If you don't it doesn't.
Tc does supply syntatic sugar for this as well. I could of written this way, and generated an identical filter item:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32
classid 1:4
ht 1:0
match tcp protocol 6 ff
Recall that I said modification made to permoff and tempoff only applies
while called filter list is being executed as the old values are
restored if the called filter list fails to classify the packet. This
was a lie. Permoff is restored, but tempoff isn't.
This can make for subtle suprises in the way a U32 filter executes,
because you tend to assume that during the execution of a filter list
permoff and tempoff never change. But if you link to another list
tempoff may change. I recommend always using permoff's
(ie, always specify "eat", and never use "nexthdr+") to avoid this.
Reference
Handles
The u32 filter uses 3 numbers for its handle. These numbers are written:
H:B:I, eg 1:1:2. All are in hex. The first number, H, identifies a hash
table. The second number, B, identifies a bucket within the hash table,
and the third number, I, identifies the
filter item within the bucket. The combination must be unique.
Hash table numbers must lie between 001 and fff hex. The traffic control
engine will generate a hash table number for you if you don't supply
one. Generated numbers are 800 or above. The hash table number in the
handle is not used when creating or changing
a hash table item. Instead the hash table specified by the "ht" option
is used, and the hash table in the handle must be not specified, 0, or
equal to the hash table in the "ht" option.
A bucket number can range from 0 to 1 less than then number of buckets
in the parent hash table. If no bucket is specified (as in 1::2), then 0
is assumed.
Filter Item numbers must lie between 001 and fff hex. The traffic
control engine will generate a filter item number for you if you don't
supply one. The generated number is the larger of 800 and one bigger
than the current largest item number in the bucket.
Execution
Each "tc filter add ... u32" item adds either a hash table, or adds a
filter item to a bucket within a hash table. When a u32 filter is
created the root hash table, whose handle is 800::, is automatically
created. It has one bucket. The u32 starts by checking
each filter item in bucket 0 of the root hash table. Filter items
within a bucket are always checked in filter item number order. As soon
as a filter item classifies the packet the u32 filter stops execution.
Filter items may use the "link" option to execute
a filter item list held by a bucket in another hash table.
Options
classid :<classify-spec>: | flowid :<classify-spec>:
If all the match options succeed then this will :classify:
the packet and the u32 filter will stop execution. Ignored
if "link" option is given.
divisor <NUMBER>
If supplied this parameter must appear on its own, without
any other arguments. It creates a new hash table. <NUMBER>
specifies the number of buckets in the hash table. It can
range from 1 to 256, and should be a power of 2. The hash
table number is taken from the handle supplied. If no handle
is supplied a new hash table number is generated.
hashkey mask <MASK> at <AT>
If the link specified by the "link" option is taken then this
option specifies the bucket within the hash table to use. This
is how the bucket number is calculated:
1. The 32 bit work at offset <AT> is read from the packet.
2. The 32 word is masked with <MASK>. <MASK> is in hex.
3. The 4 bytes in the result are xor'ed together.
4. The result is bit-wise anded with the value (number of
buckets in the hash tabled linked to - 1).
ht <HASHTABLE-HANDLE>
This option specifies the handle of a filter item being added
or changed. The filter item in <HASHTABLE-HANDLE> must be
unspecified or 0 - it can only be specified by the "handle"
option. The bucket specified may be overridden by the "sample"
option.
link <HASHTABLE-HANDLE>
If all match options succeed in this filter item the "link"
option causes the filter items in another hash table's bucket to
be checked. If none of the filter items in the linked to bucket
classify the item then u32 filter continues checking filter
items in the current bucket. The "link"ed to bucket may link to
yet another bucket, to a maximum level of 7 such calls (in 2.4.9
.. 2.6.15). The <HASHTABLE-HANDLE> specifies the hash table to
link to. The bucket and filter item numbers in that handle must
both be unspecified or blank. Bucket 0 will be used unless
overridden by the "hashkey" option. The "offset" option can be
used to alter packet offsets in the linked to bucket.
match <selector>
This option checks if a field in the packet has a particular
value. A filter item may contain more than one "match" option.
All match options must be satisified before the filter item
considers it has a match. What the filter item does when it
has a match is specified by the "link", "classid"/"flowid", and
"police" options. If none are specified the filter item does
nothing when it matches. Selectors are described below.
offset mask <MASK> at <AT> shift <SHIFT> plus <PLUS> eat
If the link specified by the "link" option is taken then the
position of the values extracted from the packet by the hash
table linked to will be offset by this specification. This is
how the offset is evaluated & implemented:
1. The 16 bit word at offset <AT> is read from the packet. If
<AT> is not present the 16 bit work is read from offset 0.
2. The 16 bit word is masked with <MASK>. If no masked is
specified 0 is assumed.
3. The masked 16 bit word is divided by (2**<SHIFT>).
4. The resulting value has <PLUS> added to it. If not specified
<PLUS> defaults to 0.
5. If none of <MASK>, <AT>, <SHIFT>, nor <PLUS> are specified
then the current temporary offset is used.
6. If "eat" is specified the offset is permanent, and is added
to the current permanant offset. The permanent offset is
unconditionally added to the <AT> value in "match", "offset"
and "hashkey" options in the hash table linked to, and any
nested links. If "eat" is not specified the offset is
temporary. Temporary offsets any added to the "at
nexthdr+<AT>" values in "match" options, but do not effect
any other <AT> values.
7. If then specified does not classify the packet, and hence
execution resumes at the next filter item, then the permanent
offset calculated here is discarded. The temporary offset,
however, remains in effect.
8. When the u32 filter starts executing both the permanent and
temporary offset are initialised to 0.
police <police-spec>
If all the match options succeed then this will :police: the
packet and the u32 filter will stop execution. Ignored if "link"
option is given.
sample <selector>
This option computes the bucket for the filter item being or
changed from the <selctor> passed. The packet offset and
mask parts of the selector are ignored if given. When
calculating the hash bucket, the divisor in the target hash
bucket is assumed to be 256. There is no way of altering
this. If the divisor isn't 256, use the "ht" option instead.
Selectors are described below.
Selectors
Selectors are used by the match option to extract information from the
packet and compare it to a value. All selectors compile to the one
format which is accepted by the kernel. This format reads a 32 value
from the supplied offset within the packet. The offset
must be on a 32 bit boundary. The value read is bit wise anded with the
supplied mask. If match succeeds if the result is equal to the supplied
value. In C:
if ((*(u32*)((char*)packet+offset+permoff) & mask) == value)
match();
The "permoff" variable in this statement is calculated by the "offset" option that executed this filter list.
Here are some conventions which won't be repeated below for brevity:
at nexthdr+<OFFSET>
Except where noted this can be appended to all selectors to
override the default position of the field in the packet. The
<OFFSET> is the offset within the packet where the field can
be found. If an 16 bit value is being compared the <OFFSET>
should be on a 16 bit boundary, and if a 32 bit value is being
compared if should be on a 32 bit boundary. The <OFFSET> is
given in decimal; prefix with 0x to enter it in hex. If
"nexthdr+" is present any temporary offset calculated by the
"offset" option is added to <OFFSET>. The current permanent
offset calculated by the "offset" optional is unconditionally
added to <OFFSET>. It is unlikely you will want to specify
the "at" option with anthing other than u32, u16 and u8
selectors.
<IP6ADDR>/<CIDR>
This specifies set of up to 4 32 bit masks and values that
will match a 128 bit IPv6 address. The combined values equal
the IPv6 address supplied, which may be in any IPv6 address
format. The combined masks are derived from the <CIDR>
portion - it is a 128 bit word with the upper <CIDR> bits
set to 1's, the rest are 0's. If the HOST is not given the
host is all 1's. The IP address must be numeric.
<IPADDR>/<CIDR>
This specifies a mask and value. The value is equal to the
IPv4 address supplied. The mask is derived from the
<CIDR> portion - it is a 32 bit word with the upper
<CIDR> bits set to 1's, the rest are 0's. If <CIDR>
is not given the mask is all 1's. The IP address must be
numeric. For example, 192.168.10.0/24 would yield a value
of c0a80a00 hex and a mask of ffffff00 hex.
<MASK>
This specifies a mask value the field will be bit wise anded
with before being compared to <VALUE>. It is given in hex.
<VALUE>
This specifies the value the field extracted from the packet
must equal, after being anded with the <MASK>. It is decimal,
unless prefixed with 0x, in which case it is hex. Ie 0x10 and
16 both mean the same thing.
Here are the selectors that can follow a "match" or "sample" option:
icmp code <VALUE> <MASK>
Match the 8 bit code field an the icmp packet. This must
be in a hash table that is "link"ed to by a filter item which
contains an "offset" option that skips the IP header.
icmp type <VALUE> <MASK>
Match the 8 bit type field an the icmp packet. This must be
in a hash table that is "link"ed to by a filter item which
contains an "offset" option that skips the IP header.
ip df
Matches if the IPv4 packet has the "don't fragment" bit set.
May not be followed by an "at" option.
ip dport <VALUE> <MASK>
Matches the 16 bit desination port in a tcp or udp IPv4 packet.
This only works if the ip header contains no options. Use the
"link" and "match tcp dst" or "match udp dst" option if you can
not be sure of that.
ip dst <IPADDR>/<CIDR>
Matches the destination IP address of an IPv4 packet.
ip firstfrag
Matches is this IPv4 packet is not fragmented, or it the first
first fragment.
ip icmp_code <VALUE> <MASK>
Matches the 8 bit code field in icmp IPv4 packet. This only
works if the ip header contains no options. Use the "link"
and "match icmp code" options if you can not be sure of that.
ip icmp_type <VALUE> <MASK>
Matches the 8 bit type field in ICMP IPv4 packet. This only
works if the ip header contains no options. Use the "link"
and "match ip icmp" options if you can not be sure of that.
ip ihl <VALUE> <MASK>
Matches the 8 bit ip version + header length byte in the IPv4
header.
ip mf
Matches if the IPv4 packet is there are more fragments from the
same packet to follow this one. May not be followed by an "at"
option.
ip nofrag
Matches if this is not a fragmented IPv4 packet. May not be
followed by an "at" option.
ip protocol <VALUE> <MASK>
Matches the 8 bit protocol byte in the IPv4 header. You can
not use symbolic protocol names (eg "tcp" or "udp").
ip sport <VALUE> <MASK>
Matches the 16 bit source port in a TCP or UDP IPv4 packet.
This only works if the ip header contains no options. Use the
"link" and "match tcp src" or "match udp src" options if you
can not be sure of that.
ip src <IPADDR>/<CIDR>
Matches the source IP address of an IPv4 packet.
ip tos <VALUE> <MASK> | ip precedence <VALUE> <MASK>
Matches the 8 bit TOS byte in the IPv4 header.
ip6 dport <VALUE> <MASK>
Matches the 16 bit desination port in a TCP or UDP IPv6 packet.
This only works if the ip header contains no options. Use the
"link" and "match ip tcp" or "match ip udp" options if you can
not be sure of that.
ip6 dst <IP6ADDR>/<CIDR>
Matches the destination IP address of an IPv6 packet.
ip6 icmp_code <VALUE> <MASK>
Matches the 8 bit code field in ICMP IPv6 packet. This only
works if the ip header contains no options. Use the "link" and
"match icmp" options if you can not be sure of that.
ip6 icmp_type <VALUE> <MASK>
Matches the 8 bit type field in an ICMP IPv4 packet. This only
works if the ip header contains no options. Use the "link" and
"match icmp" options if you can not be sure of that.
ip6 flowlabel <VALUE> <MASK>
Matches the 32 bit flowlabel in the IPv6 header.
ip6 priority <VALUE> <MASK>
Matches the 8 bit priority byte in the IPv6 header.
ip6 protocol <VALUE> <MASK>
Matches the 8 bit protocol byte in the IPv6 header. You can
not use symbolic protocol names (eg "tcp" or "udp").
ip6 sport <VALUE> <MASK>
Matches thw 16 bit source port in a TCP or UDP IPv6. This only
works if the ip header contains no options. Use the "link" and
"match tcp src" or "match udp src" options if you can not be sure
of that.
ip6 src <IPADDR>/<CIDR>
Matches the src IP address in an IPv6 packet.
tcp dst <VALUE> <MASK>
Match the 16 bit destination port in the tcp packet. This must
be in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.
tcp src <VALUE> <MASK>
Match the 16 bit source port in the tcp packet. This must be
in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.
u16 <VALUE> <MASK>
Match a 16 bit value in the packet. The offset defaults to 0
which is usually not want you want, at append the "at" option
to give the correct value.
u32 <VALUE> <MASK>
Match a 32 bit value in the packet. The offset defaults to 0
which is usually not want you want, at append the "at" option
to give the correct value.
u8 <VALUE> <MASK>
Match a 8 bit value in the packet. The offset defaults to 0
which is usually not want you want, at append the "at" option
to give the correct value.
udp dst <VALUE> <MASK>
Match the 16 bit destination port in the udp packet. This must
be in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.
udp src <VALUE> <MASK>
Match the 16 bit source port in the udp packet. This must be
in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.