strgrp: Add cosine similarity filter
The cosine similarity measure[1] (O(m + n)) contributes a decent runtime reduction when used as a filter prior to execution of more expensive algorithms such as LCS[2] (O(m * n)). A private test set of 3500 strings was used to quantify the improvement. The shape of the test set is described by Python's Pandas module as: >>> frames.apply(len).describe() count 3500.000000 mean 47.454286 std 14.980197 min 10.000000 25% 37.000000 50% 45.000000 75% 61.000000 max 109.000000 dtype: float64 >>> The tests were performed on a lightly loaded Lenovo X201s stocked with a Intel Core i7 L 640 @ 2.13GHz CPU with 3.7 GiB of memory. The test was compiled with GCC 4.9.3: $ gcc --version gcc (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Using the test program outlined below, ten runs for input set sizes incrementing in batches of 500 were taken prior to filtering with cosine similarity: 500: 0.61, 0.25, 0.08, 0.07, 0.07, 0.07, 0.09, 0.07, 0.07, 0.07 1000: 0.33, 0.32, 0.34, 0.32, 0.32, 0.33, 0.32, 0.32, 0.34, 0.32 1500: 0.72, 1.53, 0.72, 0.70, 0.72, 0.70, 0.72, 0.71, 1.46, 0.71 2000: 1.22, 1.20, 1.22, 1.98, 1.20, 1.20, 1.22, 1.94, 1.20, 1.20 2500: 1.97, 2.72, 1.94, 2.33, 2.44, 1.94, 2.82, 1.93, 1.94, 2.69 3000: 2.69, 3.41, 2.66, 3.38, 2.67, 3.42, 2.63, 3.44, 2.65, 3.39 3500: 4.18, 3.65, 4.21, 4.21, 3.56, 4.21, 4.16, 3.63, 4.20, 4.13 After adding the cosine similarity filter the runtimes became: 500: 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02 1000: 0.08, 0.07, 0.07, 0.07, 0.08, 0.07, 0.07, 0.08, 0.07, 0.07 1500: 0.16, 0.16, 0.15, 0.16, 0.16, 0.15, 0.15, 0.15, 0.16, 0.16 2000: 0.26, 0.26, 0.25, 0.26, 0.26, 0.26, 0.25, 0.26, 0.26, 0.26 2500: 0.41, 0.41, 0.41, 0.40, 0.42, 0.42, 0.42, 0.41, 0.41, 0.41 3000: 0.58, 0.56, 0.57, 0.56, 0.58, 0.57, 0.56, 0.56, 0.57, 0.55 3500: 0.75, 0.74, 0.73, 0.74, 0.74, 0.73, 0.72, 0.75, 0.75, 0.75 As such, on average the runtime improvements are: N Avg Pre Avg Post Improvement (Pre / Post) 500 0.145 0.02 7.25 1000 0.326 0.073 4.47 1500 0.869 0.156 5.57 2000 1.358 0.258 5.26 2500 2.272 0.412 5.51 3000 3.034 0.566 5.36 3500 4.014 0.74 5.42 The test driver is as below, where both it and its dependencies were compiled with 'CFLAGS=-O2 -fopenmp' and linked with 'LDFLAGS=-fopenmp', 'LDLIBS=-lm': $ cat test.c #include "config.h" #include <stdio.h> #include <stdlib.h> #include <string.h> #include "ccan/strgrp/strgrp.h" int main(void) { FILE *f; char *buf; struct strgrp *ctx; f = fdopen(0, "r"); #define BUF_SIZE 512 buf = malloc(BUF_SIZE); ctx = strgrp_new(0.85); while(fgets(buf, BUF_SIZE, f)) { buf[strcspn(buf, "\r\n")] = '\0'; if (!strgrp_add(ctx, buf, NULL)) { printf("Failed to classify %s\n", buf); } } strgrp_free(ctx); free(buf); fclose(f); return 0; } [1] https://en.wikipedia.org/wiki/Cosine_similarity [2] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
Showing
Please register or sign in to comment