History: For several years I've been occasionally posting newsgroup articles mentionning what I call "ProxHash". For example: 1998 2002 . . 2003 2004 2005 2007
Now (2007.Oct.23) I've started a more formal description of some aspects of my methodology.
Introduction: Suppose we have a large (potentially infinite) space of undefined items we call "documents". For example we might have strings of text (US-ASCII for my test purposes, but UniCode more generally). For this space we then define a finite but large number of "features". Next we define a way to measure, for each combination of document and feature, how strongly that feature is represented in that document, i.e. the quantity or number of times that feature is present in that document. (The result is always a non-negative number, zero if the feature isn't present in that document.) For example, our features might be trigrams of characters, and we simply tabulate how many copies of a particular trigram appear within a particular document. Mapping across all possible trigrams, each single document yields an immense vector, one element per possible trigram, with a vast majority of the elements zero, just a few of the elements nonzero (hence positive).
Now if we directly measure the Euclidean distance between two of these vectors, we have a problem: All small documents tend to have values near zero, and they tend to cluster together, despite being on totally different topics. Contrarywise large documents tend to have values far from zero, and far from each other because the absolute number of differences is large even if the documents are in fact very similar. For example, two documents with two features each, none of the features in common, might have vectors of <0,0,1,1> and <1,1,0,0> which have a Euclidea distance of sqrt(4) which is exactly 2. But two documents with a hundred features each, with 90% of them shared, only 10% different, would have a Euclidean distance of sqrt(20) which is appx. 4.47, a much greater distance, not what we'd like. One simple fix for this problem is to disallow empty documents (those which have no features), and for all the rest normalize the vector of feature strengths so that it has a norm of 1. This requires using floating point or other non-integer representations, but that's not an excessive cost on modern computers. With this done, in effect projecting our space onto (one hyper-quadrant of) a hyper-sphere of one less dimension than the original space, our measure of Euclidean distance is essentially the same as the naive concept of fractional dissimilarity, the opposite of fractional similarity. But unlike direct measures of similarity, we don't have paradoxes about whether a substring matches a superstring 100% or less, depending on whether we are counting what part of the substring is in common (100%) or what part of the superstring is in common (less than 100%).
The next step is to actually compute the ProxHash on each normalized vector, reducing them to vectors of all the same fixed length. This is done by applying a linear transformation to the normalized vector, where the coefficients are pseudo-random but fixed. (Generally I use a transformation to a 15-component vector.) Until I can get a patent on my method, the details will remain a trade secret, sorry. You'll have to just accept that I have a simple way to do this task, and that my software for it actually works, and that it isn't horribly expensive computewise.
Putting all three steps together:
document -> featStreng vec -> norm vec -> ProxHash vector
we have a function from document to ProxHash vector in Euclidean 15 space.
Note we could reverse the last two steps like this:
document -> featStreng vec -> ProxHash vec -> norm PHvec
That is we apply the linear transformation first then normalize the
resultant 15-vector. This assures that our final result is on the
unit hyper-sphere. I'm not sure which way is best. Does anybody
expert in vector-method information retrieval have an opinion based
on experience trying something similar both ways?
Note: I have a technical-jargon question. I've used the word "feature" for those characteristics of documents that may or may not be present and if present may be present more than once. But in math these are usually called "characters". Unfortunately when processing textual documents the word "character" already means a unit of textual data, and I don't want the conflict between the two meanings here. But maybe there's yet another word other than "character" or "feature" that would be better for this document?
Using the ProxHash:
Basic idea is triangle relaxation: Given three nodes with links between all pairs, omit the longest link. Given three nodes with only two links, tentatively add the third to see whether it's shorter than one of the old links or not, keeping the two shortest links and discarding the longest in all cases. But keeping a link is only temporary if it becomes the longest link in some *other* triangle. This triangle-relaxation process can undergo a domino effect (fecundity equal 1 for a while) or even a chain reaction (fecundity greater than for a while).
But given a large set of nodes, starting with *every* possible link (n*(n-1)/2 altogether) is prohibitively expensive. So how to start with a much smaller number of links which nevertheless is sufficient to find all/most non-redundant neighbor links?
My first idea was to generate trial links at random from among those n*(n-1)/2 possible links, and stop the loop when we seem to be reaching diminishing returns, when nearly all new links simply reduce to links we already had. But I never was comfortable with setting a threshold for such a heuristic, so I didn't program it.
My next idea was to forego getting a (nearly-)complete set of non-redundant links, and instead just get a spanning tree approximating minimum total length, using dynamic programming based on including successive coordinates to provide the backtracking element. My algorithm reads in records in sequence, and for each after the first it uses dynamic programming with worst-case cutoffs to avoid exploring known-impossible branches. Each new node gets linked to exactly one old node, which is guaranteed to be the nearest old node to the new node. As a result, this algorithm is useful for matching queries against already-treed data without needing to build anything close to a complete tree of triangle non-redundant edges.
My latest idea, first tried by hand today (Oct.23) with just two dimensions, is to independently sort the records separately according to each component of the ProxHash, and use all adjacent-record links per each component as trial edges for triangle relaxation. For the purpose of my manual experiment, I used the rank along each component rather than computing actual ProxHash values, including visual judgement most of the time, manually computing Euclidean distance per the rank in cases to close to call by eye. Also, I simply generated a random permutation for each axis totally unrelated to any actual ProxHash function. The result was quite successful, in both cases the cross-linked grid of trial links along two axes relaxed to a set of polygons (all 4,5 sided in one test, 4,5,6,8 sided in another test), with some dead-end spurs around the edges. In the first example, there was only one non-redundant edge that failed to be generated, and in the second example not even one non-redundant edge was missed. Next I need to use actual ProxHash data, writing software to use this algorithm instead of doing it manually, and use more than two dimensions to see whether the good results continue.
Now regardless of what algorithm is used to link nodes together into a graph, whether it's only nearest neighbor or a near-full triangle-relaxed graph, a fundamental question is how many dimensions (components of ProxHash) are optimum, enough to reduce the discrepancy between ProxHash and true feature-vector distance to near zero without unnecessarily including additional dimensions that aren't really needed. My intuition says 15 is plenty but 5 may be almost enough.
To decide that question, I plan to use the nearly-complete graph of triangle non-redundant links from the last algorithm above as a list of test cases for comparing true feature vector distance from ProxHash (with only a very few components) distance. For each such test set, I'll fit a linear function of ProxHash distance against true distance and report both the offset from the origin and the resudual variance. I'll start with the utterly silly case of just a single component, then repeat with 2 components, then 3, etc. Thus see how many dimensions are needed to adequately reflect the true NormFeat distances.
But which way is really best to regress, given ProxHash show scatter in NormFeat, or given NormFeat show scatter in ProxHash? Why did I decide on the latter?
The code for ProxHash was written several years ago, but it was a bunch of functions in a huge anti-spam program, not modularized into related groups of functions. Before I use that code again, I need to pull each module out to an individual file and make sure it still works. Then I need to write the test rig to generate the set of test records (probably I'll just read in the first N of my most-common-words used for my preschool flashcard drill program) and the code for the new algorithms discussed above (separate per-component sorting to get starting links to triangle-relax, actual triangle-relaxation algorithm, linear regression, and reporting the residual variances).
One variation on my idea: The previous version of my idea for starting with single dimension then bumping up the dimensions one at a time while performing triangle relaxation would have kept all the early stages of that process as backups. Then when new data enters, the new data gets inserted into the one-dimensional sequence by binary search, then mapped across from there into the two-dimensional graph which is immediately relaxed as needed, then mapped to three-dimensional graph likewise, etc. So at any time there's the latest version of graphs for all the various numbers of dimensions, all linked together into a single multi-layer structure. Now that I've tested the main algorithm for the transition from 1-d to 2-d, so I'm tentatively confident in this general type of algorithm, I'll probably do likewise except I'll retain all the per-single-dimension sequences, not just the very first, to automatically generate preliminary links to relax, not just two links per the first dimension but 2*N total preliminary links for each new record added to the structure.
-----------
Copyright 2007 by Robert Elton Maas, all rights reserved