Document Analysis using Support Vector Machines
This post details the Vector Space Kernel Model for document analysis outlined in Shawe-Taylor and Cristianini.
Create Encoded Matrix for each Document #
- Select Dictionary of Terms
- Calculate Term Frequency
- Encode using Dictionary of Terms
Calculate Document-Term Matrix #
The matrix representation of the document term frequencies shows the frequency of a term across a collection of documents.
Kernel Methods for Support Vector Machines #
If x and y are vectors representing a document then a kernel mapping K would be defined as:
K(x, y) = φ(x) · φ(y) = φ(x · y)
where the kernel K, the dot product in the new feature space, is defined as a function of the dot product in the original feature space.
Document Analysis #
Given the document-term matrix (D) and the term-document matrix (D’), define K = DD’ the co-occurrence matrix. Then for documents d_1 and d_2, define the Vector Space Kernel as
VSK(d_1, d_2) = φ(d_1) D D' φ(d_2)
Relevance Matrix #
A relevancy matrix R defines the weight given to each term in the document, and a proximity matrix P defines the distance between terms. The relevancy matrix captures a distribution of the inverse document frequency. Although not hierarchical, a non-zero term in the proximity matrix implies a co-occurrence of terms, which implies less semantic distance. In order to reduce the impact of the document length on the semantic distance, the matrices require normalization.
S = RP
The relevancy matrix R is defined based on the term weight w shown in equation,
w(t) = ln( L / df(t) )
where L is the number of documents and df(t) is the document frequency for the term t. This ratio is the inverse of the document frequency across the entire corpus.
Proximity Matrix #
The proximity matrix P is defined as the transpose of matrix D.
P = D'
Then, based on P the Proximity Kernel, D’D - indicates a value of term co-occurrence. From co-occurrence information semantic relations can be inferred between terms.
Given a kernel mapping of the term co-occurrence matrix D’D, singular value decomposition yields the matrices U, ∑, and V, where ∑ is a semantic matrix and V are the relevant topics within the document.
φ(d1) D’ D φ(d2)’
D’ = U ∑ V’
Semantic Analysis #
Several approaches exist the representation of for semantic information and construction of semantic kernels.
Semantic Matrix Composition #
Given a document-term matrix and a term co-occurrence kernel mapping (D’D), a semantic matrix can be composed from the product of a relevancy matrix and proximity matrix.
Implicit Semantic Mapping #
By taking the inner product of each basic block’s document term frequency, a kernel is composed from the corpus of semantic matrices.
PK(d_1, d_2) = φ(d_1) U_prime ∑ U φ(d_2)_prime
Topic Discovery #
In order to discover topics within a vector space representation of a document, Singular Value Decomposition is used which yields a matrix V such that columns of V are Eigenvectors of the linear combination DD’ representing term co-occurrence. Minimizing the combination of topics while maximizing the classification accuracy allows relevant topics to be extracted.
Explicit Semantic Mapping #
Semantic information inferred from the relevance and proximity information can be explicitly specified in a semantic or conceptual network. These structures have an intrinsic metric of semantic distance, and proximity can be measured via distance within the network.
J.M.
February 2020
Subscribe to updates here.