Matching similar strings

I have a list of conferences on different topics, e.g.

Conference on genomics and neurosciences

Advances in string theory and astrophysics 

Genomics and neuroscience: 20 years of research

Swiss Physics society meeting on string theory and astrophysics

...

They fall into different classes, like 1 and 3, 2 and 4 together. What is the right tool to group those titles?

asked Jun 25 '18 at 21:46

LazyCat

1062

bumped to the homepage by Community♦ 1 min ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
You can try this approach: datascience.stackexchange.com/a/35482/54395. It only checks semantic similarity, but might be enough as a start.
$endgroup$
– BrunoGL
Jul 15 '18 at 9:53

add a comment |

I have a list of conferences on different topics, e.g.

Conference on genomics and neurosciences

Advances in string theory and astrophysics 

Genomics and neuroscience: 20 years of research

Swiss Physics society meeting on string theory and astrophysics

...

They fall into different classes, like 1 and 3, 2 and 4 together. What is the right tool to group those titles?

asked Jun 25 '18 at 21:46

LazyCat

1062

bumped to the homepage by Community♦ 1 min ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
You can try this approach: datascience.stackexchange.com/a/35482/54395. It only checks semantic similarity, but might be enough as a start.
$endgroup$
– BrunoGL
Jul 15 '18 at 9:53

add a comment |

I have a list of conferences on different topics, e.g.

Conference on genomics and neurosciences

Advances in string theory and astrophysics 

Genomics and neuroscience: 20 years of research

Swiss Physics society meeting on string theory and astrophysics

...

They fall into different classes, like 1 and 3, 2 and 4 together. What is the right tool to group those titles?

asked Jun 25 '18 at 21:46

LazyCat

1062

I have a list of conferences on different topics, e.g.

Conference on genomics and neurosciences

Advances in string theory and astrophysics 

Genomics and neuroscience: 20 years of research

Swiss Physics society meeting on string theory and astrophysics

...

They fall into different classes, like 1 and 3, 2 and 4 together. What is the right tool to group those titles?

nlp

asked Jun 25 '18 at 21:46

LazyCat

1062

asked Jun 25 '18 at 21:46

LazyCat

1062

asked Jun 25 '18 at 21:46

LazyCat

1062

asked Jun 25 '18 at 21:46

LazyCat

1062

asked Jun 25 '18 at 21:46

LazyCat

1062

bumped to the homepage by Community♦ 1 min ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 1 min ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
You can try this approach: datascience.stackexchange.com/a/35482/54395. It only checks semantic similarity, but might be enough as a start.
$endgroup$
– BrunoGL
Jul 15 '18 at 9:53

add a comment |

1

$begingroup$
You can try this approach: datascience.stackexchange.com/a/35482/54395. It only checks semantic similarity, but might be enough as a start.
$endgroup$
– BrunoGL
Jul 15 '18 at 9:53

You can try this approach: datascience.stackexchange.com/a/35482/54395. It only checks semantic similarity, but might be enough as a start.

– BrunoGL
Jul 15 '18 at 9:53

add a comment |

1 Answer
1

active

oldest

votes

I assume you have some training data with labels, i.e. data where the titles are already linked to a given class? This is then supervised learning (as opposed to unsupervised learning), and so you could folow the following steps:

Step 1: you have words as input, so you will need a method to create numerical representation (vectors). For that you could look into algorithms such as Word2Vec, Doc2Vec, GLoVE or something like TF-IDF. If you go for the first, you might consider trying the spaCy library in python. Here is a tutorial on Word2Vec using spaCy.

Step 2: once you have your numerical representations for each of your titles, you need to somehow classify them. You could do this a few ways. Perhaps the simplest would be something like a clustering algorithm, e.g. the DB-Scan algorithm in SciKit Learn - here is a demo.
You could try more complicated methods, such as Support Vector Machines or Neural Networks, but probably best to start with a method that will get you to some results more quickly. You are classififying titles, so be sure to form your problem as a classification as opposed to a regression problem.

Step 3: assess your results and try changing a part of the loop above.

In the above, I assumed you are talking about the semantic meaning of the conference titles, and not similarity between literal word/letter combinations. That could of course be computed analytically, without the use of a model that learns.

In response to OP's comment:
From my experience, using TF-IDF or something called minimal new sets might be a good way to get your titles into representations that allow clustering. Once clusters are formed, it would be up to you to then interpret them and assign labels. If you know that there are e.g. only 10 conference, it shouldn't be too difficult to reach results. Have a look at this master thesis that does a similar thing - instead of conferences, they want to detect topics. Disclaimer: I supervised that thesis.

edited Jun 26 '18 at 14:34

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

$begingroup$
Thank you, I am going over your suggestions. The input is as listed, so no labels. A rather naive question: I can, for example, just try to match words in conference titles, the more words match, the closer the titles are and put a threshold, like if > 3 words match, declare them conferences on the same topic. There is a number of garbage words like "Conference", "Advances", "Workshop", "Society" and such, which I'll have to ignore. On a heuristical level, what would relatively advanced tools that you've mentioned give me over this approach?
$endgroup$
– LazyCat
Jun 26 '18 at 13:59

$begingroup$
@LazyCat - see my edit.
$endgroup$
– n1k31t4
Jun 26 '18 at 14:31

$begingroup$
@LazyCat you can usually turn unsupervised data into supervised. Is there a reason why you can't take a sample of your data and label it and proceed accordingly? That will give you the results you seek in an efficient, algorithmic approach.
$endgroup$
– I_Play_With_Data
Sep 24 '18 at 21:43

$begingroup$
@UnknownCoder It easy to turn supervised into unsupervised, but this the first time I hear about the other way around.
$endgroup$
– LazyCat
Sep 25 '18 at 0:15

$begingroup$
@LazyCat Sure, it happens all the time and is a perfectly acceptable practice. In fact, you can "bootstrap" your way into a model. Label a few dozen records, train a model and then use the model to run predictions. Now sit there and check the predictions and use the correct predictions and add them to your training set. Re-train the model with the newly expanded training set, use the model to run predictions, etc, etc. Before you know it you will have a training set with a pretty good number of labeled examples.
$endgroup$
– I_Play_With_Data
Sep 26 '18 at 18:51

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33644%2fmatching-similar-strings%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Step 3: assess your results and try changing a part of the loop above.

edited Jun 26 '18 at 14:34

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

$begingroup$
Thank you, I am going over your suggestions. The input is as listed, so no labels. A rather naive question: I can, for example, just try to match words in conference titles, the more words match, the closer the titles are and put a threshold, like if > 3 words match, declare them conferences on the same topic. There is a number of garbage words like "Conference", "Advances", "Workshop", "Society" and such, which I'll have to ignore. On a heuristical level, what would relatively advanced tools that you've mentioned give me over this approach?
$endgroup$
– LazyCat
Jun 26 '18 at 13:59

$begingroup$
@LazyCat - see my edit.
$endgroup$
– n1k31t4
Jun 26 '18 at 14:31

$begingroup$
@LazyCat you can usually turn unsupervised data into supervised. Is there a reason why you can't take a sample of your data and label it and proceed accordingly? That will give you the results you seek in an efficient, algorithmic approach.
$endgroup$
– I_Play_With_Data
Sep 24 '18 at 21:43

$begingroup$
@UnknownCoder It easy to turn supervised into unsupervised, but this the first time I hear about the other way around.
$endgroup$
– LazyCat
Sep 25 '18 at 0:15

$begingroup$
@LazyCat Sure, it happens all the time and is a perfectly acceptable practice. In fact, you can "bootstrap" your way into a model. Label a few dozen records, train a model and then use the model to run predictions. Now sit there and check the predictions and use the correct predictions and add them to your training set. Re-train the model with the newly expanded training set, use the model to run predictions, etc, etc. Before you know it you will have a training set with a pretty good number of labeled examples.
$endgroup$
– I_Play_With_Data
Sep 26 '18 at 18:51

add a comment |

Step 3: assess your results and try changing a part of the loop above.

edited Jun 26 '18 at 14:34

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

$begingroup$
Thank you, I am going over your suggestions. The input is as listed, so no labels. A rather naive question: I can, for example, just try to match words in conference titles, the more words match, the closer the titles are and put a threshold, like if > 3 words match, declare them conferences on the same topic. There is a number of garbage words like "Conference", "Advances", "Workshop", "Society" and such, which I'll have to ignore. On a heuristical level, what would relatively advanced tools that you've mentioned give me over this approach?
$endgroup$
– LazyCat
Jun 26 '18 at 13:59

$begingroup$
@LazyCat - see my edit.
$endgroup$
– n1k31t4
Jun 26 '18 at 14:31

$begingroup$
@LazyCat you can usually turn unsupervised data into supervised. Is there a reason why you can't take a sample of your data and label it and proceed accordingly? That will give you the results you seek in an efficient, algorithmic approach.
$endgroup$
– I_Play_With_Data
Sep 24 '18 at 21:43

$begingroup$
@UnknownCoder It easy to turn supervised into unsupervised, but this the first time I hear about the other way around.
$endgroup$
– LazyCat
Sep 25 '18 at 0:15

$begingroup$
@LazyCat Sure, it happens all the time and is a perfectly acceptable practice. In fact, you can "bootstrap" your way into a model. Label a few dozen records, train a model and then use the model to run predictions. Now sit there and check the predictions and use the correct predictions and add them to your training set. Re-train the model with the newly expanded training set, use the model to run predictions, etc, etc. Before you know it you will have a training set with a pretty good number of labeled examples.
$endgroup$
– I_Play_With_Data
Sep 26 '18 at 18:51

add a comment |

Step 3: assess your results and try changing a part of the loop above.

edited Jun 26 '18 at 14:34

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

Step 3: assess your results and try changing a part of the loop above.

edited Jun 26 '18 at 14:34

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

edited Jun 26 '18 at 14:34

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

answered Jun 25 '18 at 22:46

n1k31t4

6,8312422

$begingroup$
Thank you, I am going over your suggestions. The input is as listed, so no labels. A rather naive question: I can, for example, just try to match words in conference titles, the more words match, the closer the titles are and put a threshold, like if > 3 words match, declare them conferences on the same topic. There is a number of garbage words like "Conference", "Advances", "Workshop", "Society" and such, which I'll have to ignore. On a heuristical level, what would relatively advanced tools that you've mentioned give me over this approach?
$endgroup$
– LazyCat
Jun 26 '18 at 13:59

$begingroup$
@LazyCat - see my edit.
$endgroup$
– n1k31t4
Jun 26 '18 at 14:31

$begingroup$
@LazyCat you can usually turn unsupervised data into supervised. Is there a reason why you can't take a sample of your data and label it and proceed accordingly? That will give you the results you seek in an efficient, algorithmic approach.
$endgroup$
– I_Play_With_Data
Sep 24 '18 at 21:43

$begingroup$
@UnknownCoder It easy to turn supervised into unsupervised, but this the first time I hear about the other way around.
$endgroup$
– LazyCat
Sep 25 '18 at 0:15

$begingroup$
@LazyCat Sure, it happens all the time and is a perfectly acceptable practice. In fact, you can "bootstrap" your way into a model. Label a few dozen records, train a model and then use the model to run predictions. Now sit there and check the predictions and use the correct predictions and add them to your training set. Re-train the model with the newly expanded training set, use the model to run predictions, etc, etc. Before you know it you will have a training set with a pretty good number of labeled examples.
$endgroup$
– I_Play_With_Data
Sep 26 '18 at 18:51

add a comment |

$begingroup$
Thank you, I am going over your suggestions. The input is as listed, so no labels. A rather naive question: I can, for example, just try to match words in conference titles, the more words match, the closer the titles are and put a threshold, like if > 3 words match, declare them conferences on the same topic. There is a number of garbage words like "Conference", "Advances", "Workshop", "Society" and such, which I'll have to ignore. On a heuristical level, what would relatively advanced tools that you've mentioned give me over this approach?
$endgroup$
– LazyCat
Jun 26 '18 at 13:59

$begingroup$
@LazyCat - see my edit.
$endgroup$
– n1k31t4
Jun 26 '18 at 14:31

$begingroup$
@LazyCat you can usually turn unsupervised data into supervised. Is there a reason why you can't take a sample of your data and label it and proceed accordingly? That will give you the results you seek in an efficient, algorithmic approach.
$endgroup$
– I_Play_With_Data
Sep 24 '18 at 21:43

$begingroup$
@UnknownCoder It easy to turn supervised into unsupervised, but this the first time I hear about the other way around.
$endgroup$
– LazyCat
Sep 25 '18 at 0:15

$begingroup$
@LazyCat Sure, it happens all the time and is a perfectly acceptable practice. In fact, you can "bootstrap" your way into a model. Label a few dozen records, train a model and then use the model to run predictions. Now sit there and check the predictions and use the correct predictions and add them to your training set. Re-train the model with the newly expanded training set, use the model to run predictions, etc, etc. Before you know it you will have a training set with a pretty good number of labeled examples.
$endgroup$
– I_Play_With_Data
Sep 26 '18 at 18:51

Thank you, I am going over your suggestions. The input is as listed, so no labels. A rather naive question: I can, for example, just try to match words in conference titles, the more words match, the closer the titles are and put a threshold, like if > 3 words match, declare them conferences on the same topic. There is a number of garbage words like "Conference", "Advances", "Workshop", "Society" and such, which I'll have to ignore. On a heuristical level, what would relatively advanced tools that you've mentioned give me over this approach?

– LazyCat
Jun 26 '18 at 13:59

@LazyCat - see my edit.

– n1k31t4
Jun 26 '18 at 14:31

@LazyCat you can usually turn unsupervised data into supervised. Is there a reason why you can't take a sample of your data and label it and proceed accordingly? That will give you the results you seek in an efficient, algorithmic approach.

– I_Play_With_Data
Sep 24 '18 at 21:43

@UnknownCoder It easy to turn supervised into unsupervised, but this the first time I hear about the other way around.

– LazyCat
Sep 25 '18 at 0:15

@LazyCat Sure, it happens all the time and is a perfectly acceptable practice. In fact, you can "bootstrap" your way into a model. Label a few dozen records, train a model and then use the model to run predictions. Now sit there and check the predictions and use the correct predictions and add them to your training set. Re-train the model with the newly expanded training set, use the model to run predictions, etc, etc. Before you know it you will have a training set with a pretty good number of labeled examples.

– I_Play_With_Data
Sep 26 '18 at 18:51

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki