NLP: Fuzzy Word/Phrase Match
$begingroup$
I am attempting to determine if a given phrase (or a few words) is present in a relatively large text. For example:
Text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce sed
tristique purus, id lobortis justo. Vestibulum ante ipsum primis in
faucibus orci luctus et ultrices posuere cubilia Curae; Cras vitae
neque non nibh elementum malesuada convallis et nunc. Nam vel tellus
nec nunc dictum dignissim eu ut felis. In Tony Starkeget efficitur
nunc. Cras ultrices turpis est, ac eleifend leo congue at. Donec lorem
diam, mattis sed sollicitudin ac, tincidunt eu sem. Curabitur vel
euismod lectus, sit amet tempor massa. Vivamus ut dictum nisl. Aliquam
et urna sit amet urna hendrerit tincidunt in a mauris. Class aptent
taciti sociosqu ad litora torquent per conubia nostra, per inceptos
himenaeos. Maecenas vel justo metus. Sed gravida egestas velit,
porttitor pulvinar justo hendrerit et.
Phrase/Words to match in the text above:
tony.stark
t.stark
stark_tony
starktony
The intention here is to infer if the person(Tony Stark) is being mentioned in a block of text.
I have read up on some fuzzy word match algorithms like Levenshtein
and Soundex
and also tested them in the above application but they appear to be useful for one word to one word match, not in the above application where various permutations of Tony stark is possible in both the pattern and text.
Would anyone be able to advice which fuzzy word matching algorithms would be ideal for this application, and perhaps share resources and sample code for its implementation.
Thanks.
nlp
$endgroup$
add a comment |
$begingroup$
I am attempting to determine if a given phrase (or a few words) is present in a relatively large text. For example:
Text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce sed
tristique purus, id lobortis justo. Vestibulum ante ipsum primis in
faucibus orci luctus et ultrices posuere cubilia Curae; Cras vitae
neque non nibh elementum malesuada convallis et nunc. Nam vel tellus
nec nunc dictum dignissim eu ut felis. In Tony Starkeget efficitur
nunc. Cras ultrices turpis est, ac eleifend leo congue at. Donec lorem
diam, mattis sed sollicitudin ac, tincidunt eu sem. Curabitur vel
euismod lectus, sit amet tempor massa. Vivamus ut dictum nisl. Aliquam
et urna sit amet urna hendrerit tincidunt in a mauris. Class aptent
taciti sociosqu ad litora torquent per conubia nostra, per inceptos
himenaeos. Maecenas vel justo metus. Sed gravida egestas velit,
porttitor pulvinar justo hendrerit et.
Phrase/Words to match in the text above:
tony.stark
t.stark
stark_tony
starktony
The intention here is to infer if the person(Tony Stark) is being mentioned in a block of text.
I have read up on some fuzzy word match algorithms like Levenshtein
and Soundex
and also tested them in the above application but they appear to be useful for one word to one word match, not in the above application where various permutations of Tony stark is possible in both the pattern and text.
Would anyone be able to advice which fuzzy word matching algorithms would be ideal for this application, and perhaps share resources and sample code for its implementation.
Thanks.
nlp
$endgroup$
add a comment |
$begingroup$
I am attempting to determine if a given phrase (or a few words) is present in a relatively large text. For example:
Text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce sed
tristique purus, id lobortis justo. Vestibulum ante ipsum primis in
faucibus orci luctus et ultrices posuere cubilia Curae; Cras vitae
neque non nibh elementum malesuada convallis et nunc. Nam vel tellus
nec nunc dictum dignissim eu ut felis. In Tony Starkeget efficitur
nunc. Cras ultrices turpis est, ac eleifend leo congue at. Donec lorem
diam, mattis sed sollicitudin ac, tincidunt eu sem. Curabitur vel
euismod lectus, sit amet tempor massa. Vivamus ut dictum nisl. Aliquam
et urna sit amet urna hendrerit tincidunt in a mauris. Class aptent
taciti sociosqu ad litora torquent per conubia nostra, per inceptos
himenaeos. Maecenas vel justo metus. Sed gravida egestas velit,
porttitor pulvinar justo hendrerit et.
Phrase/Words to match in the text above:
tony.stark
t.stark
stark_tony
starktony
The intention here is to infer if the person(Tony Stark) is being mentioned in a block of text.
I have read up on some fuzzy word match algorithms like Levenshtein
and Soundex
and also tested them in the above application but they appear to be useful for one word to one word match, not in the above application where various permutations of Tony stark is possible in both the pattern and text.
Would anyone be able to advice which fuzzy word matching algorithms would be ideal for this application, and perhaps share resources and sample code for its implementation.
Thanks.
nlp
$endgroup$
I am attempting to determine if a given phrase (or a few words) is present in a relatively large text. For example:
Text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce sed
tristique purus, id lobortis justo. Vestibulum ante ipsum primis in
faucibus orci luctus et ultrices posuere cubilia Curae; Cras vitae
neque non nibh elementum malesuada convallis et nunc. Nam vel tellus
nec nunc dictum dignissim eu ut felis. In Tony Starkeget efficitur
nunc. Cras ultrices turpis est, ac eleifend leo congue at. Donec lorem
diam, mattis sed sollicitudin ac, tincidunt eu sem. Curabitur vel
euismod lectus, sit amet tempor massa. Vivamus ut dictum nisl. Aliquam
et urna sit amet urna hendrerit tincidunt in a mauris. Class aptent
taciti sociosqu ad litora torquent per conubia nostra, per inceptos
himenaeos. Maecenas vel justo metus. Sed gravida egestas velit,
porttitor pulvinar justo hendrerit et.
Phrase/Words to match in the text above:
tony.stark
t.stark
stark_tony
starktony
The intention here is to infer if the person(Tony Stark) is being mentioned in a block of text.
I have read up on some fuzzy word match algorithms like Levenshtein
and Soundex
and also tested them in the above application but they appear to be useful for one word to one word match, not in the above application where various permutations of Tony stark is possible in both the pattern and text.
Would anyone be able to advice which fuzzy word matching algorithms would be ideal for this application, and perhaps share resources and sample code for its implementation.
Thanks.
nlp
nlp
asked 8 mins ago
KohKoh
1162
1162
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47057%2fnlp-fuzzy-word-phrase-match%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47057%2fnlp-fuzzy-word-phrase-match%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown