Gary Warner
2012-10-20 13:10:24 UTC
My real problem has to do with recording which of a very large number of
anti-virus products agree that a given sample is a member of a given
anti-virus family. The database has millions of samples, with tens of
anti-virus products voting on each sample. I want to ask a query like
"For the malware containing the name 'BadBadVirus' which sample had the
most votes, and which vendors voted for it?" and get results like:
|"BadBadVirus"
V1 V2 V3 V4 V5 V6 V7
Sample 1 - 4 votes 1 0 1 0 0 1 1
Sample 2 - 5 votes 1 0 1 0 1 1 1
Sample 3 - 5 votes 1 0 1 0 1 1 1
total 14 3 3 2 3 3
|
Which might be used to tell me that Vendor 2 and Vendor 4 either don't
know how to detect this malware, or that they name it something different.
++++++++++++++++
I'm going to try to generalize my question slightly while hopefully not
breaking your ability to help me. Assume that I have five voters (Alex,
Bob, Carol, Dave, Ed) who have been asked to look at five photographs
(P1, P2, P3, P4, P5) and decide what the "main subject" of the
photograph is. For our example, we'll just assume they were limited to
"Cat", "Dog", or "Horse".
In my "real" data, the fields are:
MD5, Vendor, Malware_name
but for now, consider . . . the data is in the Table "PhotoVotes" in
this form Photo, Voter, Decision
(1, 'Alex', 'Cat')
(1, 'Bob', 'Dog')
(1, 'Carol', 'Cat')
(1, 'Dave', 'Cat')
(1, 'Ed', 'Cat')
(2, 'Alex', 'Cat')
(2, 'Bob', 'Dog')
(2, 'Carol', 'Cat')
(2, 'Dave', 'Cat')
(2, 'Ed', 'Dog')
(3, 'Alex', 'Horse')
(3, 'Bob', 'Horse')
(3, 'Carol', 'Dog')
(3, 'Dave', 'Horse')
(3, 'Ed', 'Horse')
(4, 'Alex', 'Horse')
(4, 'Bob', 'Horse')
(4, 'Carol', 'Cat')
(4, 'Dave', 'Horse')
(4, 'Ed', 'Horse')
(5, 'Alex', 'Dog')
(5, 'Bob', 'Cat')
(5, 'Carol', 'Cat')
(5, 'Dave', 'Cat')
(5, 'Ed', 'Cat')
The objective is that given a photo topic we are looking for, we'd like
to know how many voters thought that WAS the main point of that photo,
but also list WHICH VOTERS thought that. In the real malware-related
version we're matching with an "ilike" on the string we are looking
for. I was hoping to just "count" (1 or 0) for whether that matched.
Query for: "Cat"
Total Alex Bob Carol Dave Ed
1 - 4 1 0 1 1 1
2 - 3 1 0 1 1 0
3 - 0 0 0 0 0 0
4 - 1 0 0 1 0 0
5 - 4 0 1 1 1 1
- ------------------------------------
total 12 2 1 4 3 2
Query for: "Dog"
Total Alex Bob Carol Dave Ed
1 - 1 0 1 0 0 0
2 - 2 0 1 0 0 1
3 - 1 0 0 1 0 0
4 - 0 0 0 0 0 0
5 - 1 1 0 0 0 0
- ------------------------------------
total 5 1 2 1 0 1
Is that something I can do IN THE DATABASE with the data in the format
that I have it stored?
I'm having difficulty getting a query that does that - although it's
simple enough to dump the data out and then write a program to do that,
I'd really like to be able to do it IN THE DATABASE if I can. If it's
not "query-able" we'll probably end up writing a stored procedure...
Thanks for any suggestions.
anti-virus products agree that a given sample is a member of a given
anti-virus family. The database has millions of samples, with tens of
anti-virus products voting on each sample. I want to ask a query like
"For the malware containing the name 'BadBadVirus' which sample had the
most votes, and which vendors voted for it?" and get results like:
|"BadBadVirus"
V1 V2 V3 V4 V5 V6 V7
Sample 1 - 4 votes 1 0 1 0 0 1 1
Sample 2 - 5 votes 1 0 1 0 1 1 1
Sample 3 - 5 votes 1 0 1 0 1 1 1
total 14 3 3 2 3 3
|
Which might be used to tell me that Vendor 2 and Vendor 4 either don't
know how to detect this malware, or that they name it something different.
++++++++++++++++
I'm going to try to generalize my question slightly while hopefully not
breaking your ability to help me. Assume that I have five voters (Alex,
Bob, Carol, Dave, Ed) who have been asked to look at five photographs
(P1, P2, P3, P4, P5) and decide what the "main subject" of the
photograph is. For our example, we'll just assume they were limited to
"Cat", "Dog", or "Horse".
In my "real" data, the fields are:
MD5, Vendor, Malware_name
but for now, consider . . . the data is in the Table "PhotoVotes" in
this form Photo, Voter, Decision
(1, 'Alex', 'Cat')
(1, 'Bob', 'Dog')
(1, 'Carol', 'Cat')
(1, 'Dave', 'Cat')
(1, 'Ed', 'Cat')
(2, 'Alex', 'Cat')
(2, 'Bob', 'Dog')
(2, 'Carol', 'Cat')
(2, 'Dave', 'Cat')
(2, 'Ed', 'Dog')
(3, 'Alex', 'Horse')
(3, 'Bob', 'Horse')
(3, 'Carol', 'Dog')
(3, 'Dave', 'Horse')
(3, 'Ed', 'Horse')
(4, 'Alex', 'Horse')
(4, 'Bob', 'Horse')
(4, 'Carol', 'Cat')
(4, 'Dave', 'Horse')
(4, 'Ed', 'Horse')
(5, 'Alex', 'Dog')
(5, 'Bob', 'Cat')
(5, 'Carol', 'Cat')
(5, 'Dave', 'Cat')
(5, 'Ed', 'Cat')
The objective is that given a photo topic we are looking for, we'd like
to know how many voters thought that WAS the main point of that photo,
but also list WHICH VOTERS thought that. In the real malware-related
version we're matching with an "ilike" on the string we are looking
for. I was hoping to just "count" (1 or 0) for whether that matched.
Query for: "Cat"
Total Alex Bob Carol Dave Ed
1 - 4 1 0 1 1 1
2 - 3 1 0 1 1 0
3 - 0 0 0 0 0 0
4 - 1 0 0 1 0 0
5 - 4 0 1 1 1 1
- ------------------------------------
total 12 2 1 4 3 2
Query for: "Dog"
Total Alex Bob Carol Dave Ed
1 - 1 0 1 0 0 0
2 - 2 0 1 0 0 1
3 - 1 0 0 1 0 0
4 - 0 0 0 0 0 0
5 - 1 1 0 0 0 0
- ------------------------------------
total 5 1 2 1 0 1
Is that something I can do IN THE DATABASE with the data in the format
that I have it stored?
I'm having difficulty getting a query that does that - although it's
simple enough to dump the data out and then write a program to do that,
I'd really like to be able to do it IN THE DATABASE if I can. If it's
not "query-able" we'll probably end up writing a stored procedure...
Thanks for any suggestions.