Discussion:
"Voting" question?
(too old to reply)
Gary Warner
2012-10-20 13:10:24 UTC
Permalink
My real problem has to do with recording which of a very large number of
anti-virus products agree that a given sample is a member of a given
anti-virus family. The database has millions of samples, with tens of
anti-virus products voting on each sample. I want to ask a query like
"For the malware containing the name 'BadBadVirus' which sample had the
most votes, and which vendors voted for it?" and get results like:

|"BadBadVirus"
V1 V2 V3 V4 V5 V6 V7
Sample 1 - 4 votes 1 0 1 0 0 1 1
Sample 2 - 5 votes 1 0 1 0 1 1 1
Sample 3 - 5 votes 1 0 1 0 1 1 1

total 14 3 3 2 3 3
|

Which might be used to tell me that Vendor 2 and Vendor 4 either don't
know how to detect this malware, or that they name it something different.

++++++++++++++++

I'm going to try to generalize my question slightly while hopefully not
breaking your ability to help me. Assume that I have five voters (Alex,
Bob, Carol, Dave, Ed) who have been asked to look at five photographs
(P1, P2, P3, P4, P5) and decide what the "main subject" of the
photograph is. For our example, we'll just assume they were limited to
"Cat", "Dog", or "Horse".

In my "real" data, the fields are:
MD5, Vendor, Malware_name

but for now, consider . . . the data is in the Table "PhotoVotes" in
this form Photo, Voter, Decision

(1, 'Alex', 'Cat')
(1, 'Bob', 'Dog')
(1, 'Carol', 'Cat')
(1, 'Dave', 'Cat')
(1, 'Ed', 'Cat')
(2, 'Alex', 'Cat')
(2, 'Bob', 'Dog')
(2, 'Carol', 'Cat')
(2, 'Dave', 'Cat')
(2, 'Ed', 'Dog')
(3, 'Alex', 'Horse')
(3, 'Bob', 'Horse')
(3, 'Carol', 'Dog')
(3, 'Dave', 'Horse')
(3, 'Ed', 'Horse')
(4, 'Alex', 'Horse')
(4, 'Bob', 'Horse')
(4, 'Carol', 'Cat')
(4, 'Dave', 'Horse')
(4, 'Ed', 'Horse')
(5, 'Alex', 'Dog')
(5, 'Bob', 'Cat')
(5, 'Carol', 'Cat')
(5, 'Dave', 'Cat')
(5, 'Ed', 'Cat')

The objective is that given a photo topic we are looking for, we'd like
to know how many voters thought that WAS the main point of that photo,
but also list WHICH VOTERS thought that. In the real malware-related
version we're matching with an "ilike" on the string we are looking
for. I was hoping to just "count" (1 or 0) for whether that matched.

Query for: "Cat"
Total Alex Bob Carol Dave Ed
1 - 4 1 0 1 1 1
2 - 3 1 0 1 1 0
3 - 0 0 0 0 0 0
4 - 1 0 0 1 0 0
5 - 4 0 1 1 1 1
- ------------------------------------
total 12 2 1 4 3 2

Query for: "Dog"
Total Alex Bob Carol Dave Ed
1 - 1 0 1 0 0 0
2 - 2 0 1 0 0 1
3 - 1 0 0 1 0 0
4 - 0 0 0 0 0 0
5 - 1 1 0 0 0 0
- ------------------------------------
total 5 1 2 1 0 1

Is that something I can do IN THE DATABASE with the data in the format
that I have it stored?

I'm having difficulty getting a query that does that - although it's
simple enough to dump the data out and then write a program to do that,
I'd really like to be able to do it IN THE DATABASE if I can. If it's
not "query-able" we'll probably end up writing a stored procedure...

Thanks for any suggestions.
Josh Kupershmidt
2012-10-24 05:55:32 UTC
Permalink
On Sat, Oct 20, 2012 at 6:10 AM, Gary Warner <***@askgar.com> wrote:

[snip]
Post by Gary Warner
Query for: "Cat"
Total Alex Bob Carol Dave Ed
1 - 4 1 0 1 1 1
2 - 3 1 0 1 1 0
3 - 0 0 0 0 0 0
4 - 1 0 0 1 0 0
5 - 4 0 1 1 1 1
- ------------------------------------
total 12 2 1 4 3 2
[snip]
Post by Gary Warner
Is that something I can do IN THE DATABASE with the data in the format that
I have it stored?
With a table "PhotoVotes" roughly as you described:

Table "public.PhotoVotes"
Column | Type | Modifiers
-----------+---------+-----------
photo_num | integer |
voter | text |
decision | text |

a query like this would do the trick to generate your table, given
your sample data:

WITH distinct_photos AS (
SELECT DISTINCT(photo_num) FROM "PhotoVotes"
),
totals AS (
SELECT photo_num, COUNT(*) AS total_votes
FROM "PhotoVotes" WHERE decision = 'Cat'
GROUP BY photo_num
),
alex_votes AS (
SELECT photo_num, decision
FROM "PhotoVotes" WHERE voter = 'Alex'
)
SELECT dp.photo_num, COALESCE(totals.total_votes, 0) AS total,
(CASE WHEN alex_votes.decision = 'Cat' THEN 1 ELSE 0 END) AS "Alex"

FROM distinct_photos AS dp
LEFT JOIN totals
ON dp.photo_num = totals.photo_num
LEFT JOIN alex_votes
ON alex_votes.photo_num = dp.photo_num

ORDER BY dp.photo_num ASC;

the generation of columns for "Bob", "Carol", "Dave" and "Ed" would be
done the same way as for the "Alex" column. From your message, it
sounded like you knew in advance all the column names you expected to
be generated -- if that's not the case, you might have to look into
something like crosstab().

Josh
--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice
Loading...