Project Fuzzy Match

Udvælg først det eller de datafelter, der skal danne grundlag for match mellem to datasæt, typisk titler og/eller bibliografiske data: tidsskrifttitler, ISSN, artiklens bind og første sidetal.

Hvert enkelt ord i de valgte matchfelter splittes med en nøgle fra det pågældende datasæt, typisk postnummer - selvvalgt eller postens ID-nummer fra den oprindelige database. Brug programmet FuzzyMatch prepare i Bibliometry Toolbox. Programmet laver en liste af typen

ord1 <TAB> postID, ord2 <TAB> postID

Hver "ord <TAB> postID" liste sorteres alfabetisk efter ord, og dublet-flettes (Merge by Label i Bibliometry Toolbox).
Resultatet er en liste med hvert ord i det udvalgte datafelt + hvert postID, hvori netop dette ord findes. Listen indlæses i en database.

COMPENSATE : 2-s2.0-0030715808 | 2-s2.0-0033153983 | 2-s2.0-84882836027
COMPENSATORS: 2-s2.0-2942521062

COMPENSATE : WOS:000081321700017 | WOS:000323567300013 | WOS:A1997BJ13Z00083
COMPENSATORS: WOS:000221387600026

Ved match mellem to forskellige datasæt matches nu de to baser "ord" mod "ord" og postID fra den ene base flettes dermed ind i andet datasæts ord + postID base. Resultatet er en base med hvert ord i ene datasæt + postID1 + postID2, f.eks.:

COMPENSATE : 2-s2.0-0030715808 | 2-s2.0-0033153983 | 2-s2.0-84882836027 / WOS:000081321700017 | WOS:000323567300013 | WOS:A1997BJ13Z00083
COMPENSATORS: 2-s2.0-2942521062 / WOS:000221387600026

Sjældne ord vil ofte mappe poster direkte, men normalt skal man arbejde med mere alm. ord der mapper mange poster fra en base til mange poser fra den anden base. Man kan på en simpel måde udsortere poster med flest mulige fælles ord alene på basis af postID, hvert par af postID'er isoleres med programmet FuzzyMatch split (igen Bibliometry Toolbox)

2-s2.0-0030715808 | 2-s2.0-0033153983 | 2-s2.0-84882836027 <TAB> WOS:000081321700017 | WOS:000323567300013 | WOS:A1997BJ13Z00083

giver:

2-s2.0-0030715808 <TAB> WOS:000081321700017
2-s2.0-0033153983 <TAB> WOS:000081321700017
2-s2.0-84882836027 <TAB> WOS:000081321700017
2-s2.0-0030715808 <TAB> WOS:000323567300013
2-s2.0-0033153983 <TAB> WOS:000323567300013
2-s2.0-84882836027 <TAB> WOS:000323567300013
2-s2.0-0030715808 <TAB> WOS:A1997BJ13Z00083
2-s2.0-0033153983 <TAB> WOS:A1997BJ13Z00083
2-s2.0-84882836027 <TAB> WOS:A1997BJ13Z00083

For hele datasættet er nu hvert postID i ene datasæt nu mappet til ethvert andet postID i andet datasæt, sorter alfabetisk efter ene sæt postID og optæl antal forekomster af andet datasæts postID - brug program Count duplicated lines, sorter efter antal dubletter for udvalgte sæt postID, og brug derefter FuzzyMatch final i Bibliometry Toolbox, det postID1 der har flest postID2 hits er det mest sandsynlige match mellem de to datasæt.

På basis af postID1 og postID2 listen kan bibliografiske oplysninger for de matchede poster nu sammenlignes og evt. fejlmatch let opdages.

Oprens tekst til match

Ovenstående metode virker fint, selv med mange almindelige ord i matchede tekster, alle ikke-alfa-numeriske tegn fjernes af programmerne, men det kan være praktisk manuelt at redigere mest almindelige (og uden betydning) ord ud med denne regex:

A|AN|AND|AS|AT|BY|FOR|FROM|IN|IS|OF|ON|OR|THE|TO|WITH fjernes med:

\b[ \t]A\b |\b[ \t]AN\b | \bAND\b | \bAS\b | \bAT\b | \bBY\b | \bFOR\b | \bFROM\b | \bIN\b | \bIS\b | \bOF\b | \bON\b | \bOR\b |\b[ \t]THE\b | \bTO\b | \bWITH\b erstat med "space"
'|-|–|—|"|&|\(|\)|/|:|;|\?|\[|\]|¨|“|”|€|\+|<|>|·|³|œ|™ erstat med "space"

Andre metoder - ikke testet

http://rosettacode.org/wiki/Levenshtein_distance

EXCEL-add-in: http://precisioncalc.com/it/tutorial.htm

VBA: The algorithm in it’s current form computes the frequency of common characters between the two input strings and also the frequency of identical tuples (two-character sequences), weights them and builds a normalized score in the range of [0…1].
https://code.google.com/p/fast-vba-fuzzy-scoring-algorithm/source/browse/trunk/Fuzzy1
https://code.google.com/p/fast-vba-fuzzy-scoring-algorithm/source/browse/trunk/Fuzzy2


Q-gram distance
If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Try it like this:
transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
than, build a qgram-profile of each text:
split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
if you search for your pattern then, you do the same with your pattern,
loop through your text-qgram-database and
count for each pattern/text-pair how many qgrams are the same.
each hit will raise the score by 1.
the texts with the highest score(s) are your best hits.
If you did that you can tweak this algorithm by:

prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
store 2-grams and 3-grams both, and when counting, weigh then differently.


Medmindre andet er angivet, er indholdet af denne side licenseret under Creative Commons Attribution-ShareAlike 3.0 License