Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats












13












$begingroup$


I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:



1) Process the text of each job listing to extract skills that are mentioned in the listing



2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document



3) Calculate the TF-IDF of each skill within the career documents



After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.



This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).



It seems like a better metric would be to do the following:



1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents



2) For each career, sum the TF-IDF results for all of the user's skill



3) Rank career based on the above sum



Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!










share|improve this question









$endgroup$








  • 3




    $begingroup$
    Check out Doc2vec, Gensim has the implementation
    $endgroup$
    – Blue482
    Jan 3 '17 at 11:55










  • $begingroup$
    See datascience.stackexchange.com/questions/5121/…
    $endgroup$
    – Intruso
    Mar 4 '17 at 14:29
















13












$begingroup$


I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:



1) Process the text of each job listing to extract skills that are mentioned in the listing



2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document



3) Calculate the TF-IDF of each skill within the career documents



After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.



This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).



It seems like a better metric would be to do the following:



1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents



2) For each career, sum the TF-IDF results for all of the user's skill



3) Rank career based on the above sum



Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!










share|improve this question









$endgroup$








  • 3




    $begingroup$
    Check out Doc2vec, Gensim has the implementation
    $endgroup$
    – Blue482
    Jan 3 '17 at 11:55










  • $begingroup$
    See datascience.stackexchange.com/questions/5121/…
    $endgroup$
    – Intruso
    Mar 4 '17 at 14:29














13












13








13


5



$begingroup$


I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:



1) Process the text of each job listing to extract skills that are mentioned in the listing



2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document



3) Calculate the TF-IDF of each skill within the career documents



After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.



This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).



It seems like a better metric would be to do the following:



1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents



2) For each career, sum the TF-IDF results for all of the user's skill



3) Rank career based on the above sum



Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!










share|improve this question









$endgroup$




I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:



1) Process the text of each job listing to extract skills that are mentioned in the listing



2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document



3) Calculate the TF-IDF of each skill within the career documents



After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.



This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).



It seems like a better metric would be to do the following:



1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents



2) For each career, sum the TF-IDF results for all of the user's skill



3) Rank career based on the above sum



Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!







nlp text-mining similarity cosine-distance






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 '17 at 20:41









Richard KnocheRichard Knoche

6613




6613








  • 3




    $begingroup$
    Check out Doc2vec, Gensim has the implementation
    $endgroup$
    – Blue482
    Jan 3 '17 at 11:55










  • $begingroup$
    See datascience.stackexchange.com/questions/5121/…
    $endgroup$
    – Intruso
    Mar 4 '17 at 14:29














  • 3




    $begingroup$
    Check out Doc2vec, Gensim has the implementation
    $endgroup$
    – Blue482
    Jan 3 '17 at 11:55










  • $begingroup$
    See datascience.stackexchange.com/questions/5121/…
    $endgroup$
    – Intruso
    Mar 4 '17 at 14:29








3




3




$begingroup$
Check out Doc2vec, Gensim has the implementation
$endgroup$
– Blue482
Jan 3 '17 at 11:55




$begingroup$
Check out Doc2vec, Gensim has the implementation
$endgroup$
– Blue482
Jan 3 '17 at 11:55












$begingroup$
See datascience.stackexchange.com/questions/5121/…
$endgroup$
– Intruso
Mar 4 '17 at 14:29




$begingroup$
See datascience.stackexchange.com/questions/5121/…
$endgroup$
– Intruso
Mar 4 '17 at 14:29










4 Answers
4






active

oldest

votes


















1












$begingroup$

Perhaps you could use word embeddings to better represent the distance between certain skills. For instance, "Python" and "R" should be closer together than "Python" and "Time management" since they are both programming languages.



The whole idea is that words that appear in the same context should be closer.



Once you have these embeddings, you would have a set of skills for the candidate, and sets of skills of various size for the jobs. You could then use Earth Mover's Distance to calculate the distance between the sets. This distance measure is rather slow (quadratic time) so it might not scale well if you have many jobs to go through.



To deal with the scalability issue, you could perhaps rank the jobs based on how many skills the candidate has in common in the first place, and favor these jobs.






share|improve this answer









$endgroup$





















    1












    $begingroup$

    A common and simple method to match "documents" is to use TF-IDF weighting, as you have described. However, as I understand your question, you want to rank each career (-document) based on a set of users skills.



    If you create a "query vector" from the skills, you can multiply the vector with your term-career matrix (with all the tf-idf weights as values). The resulting vector would give you a ranking score per career-document which you can use to pick the top-k careers for the set of "query skills".



    E.g. if your query vector $bar{q}$ consists of zeros and ones, and is of size $1 times |terms|$, and your term-document matrix $M$ is of size $|terms| times |documents|$, then $bar{v} M$ would result in a vector of size $1 times |documents|$ with elements equal to the sum of every query term's TF-IDF weight per career document.



    This method of ranking is one of the simplest and many variations exist. The TF-IDF entry on Wikipedia also describes this ranking method briefly. I also found this Q&A on SO about matching documents.






    share|improve this answer











    $endgroup$













    • $begingroup$
      Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
      $endgroup$
      – wacax
      Mar 26 '18 at 21:38



















    0












    $begingroup$

    Use the Jaccard Index. This will very much serve your purpose.






    share|improve this answer











    $endgroup$





















      0












      $begingroup$

      You can try using "gensim". I did a similar project with unstructured data. Gensim gave better scores than standard TFIDF. It also ran faster.






      share|improve this answer








      New contributor




      Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "557"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16036%2falternatives-to-tf-idf-and-cosine-similarity-when-comparing-documents-of-differi%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1












        $begingroup$

        Perhaps you could use word embeddings to better represent the distance between certain skills. For instance, "Python" and "R" should be closer together than "Python" and "Time management" since they are both programming languages.



        The whole idea is that words that appear in the same context should be closer.



        Once you have these embeddings, you would have a set of skills for the candidate, and sets of skills of various size for the jobs. You could then use Earth Mover's Distance to calculate the distance between the sets. This distance measure is rather slow (quadratic time) so it might not scale well if you have many jobs to go through.



        To deal with the scalability issue, you could perhaps rank the jobs based on how many skills the candidate has in common in the first place, and favor these jobs.






        share|improve this answer









        $endgroup$


















          1












          $begingroup$

          Perhaps you could use word embeddings to better represent the distance between certain skills. For instance, "Python" and "R" should be closer together than "Python" and "Time management" since they are both programming languages.



          The whole idea is that words that appear in the same context should be closer.



          Once you have these embeddings, you would have a set of skills for the candidate, and sets of skills of various size for the jobs. You could then use Earth Mover's Distance to calculate the distance between the sets. This distance measure is rather slow (quadratic time) so it might not scale well if you have many jobs to go through.



          To deal with the scalability issue, you could perhaps rank the jobs based on how many skills the candidate has in common in the first place, and favor these jobs.






          share|improve this answer









          $endgroup$
















            1












            1








            1





            $begingroup$

            Perhaps you could use word embeddings to better represent the distance between certain skills. For instance, "Python" and "R" should be closer together than "Python" and "Time management" since they are both programming languages.



            The whole idea is that words that appear in the same context should be closer.



            Once you have these embeddings, you would have a set of skills for the candidate, and sets of skills of various size for the jobs. You could then use Earth Mover's Distance to calculate the distance between the sets. This distance measure is rather slow (quadratic time) so it might not scale well if you have many jobs to go through.



            To deal with the scalability issue, you could perhaps rank the jobs based on how many skills the candidate has in common in the first place, and favor these jobs.






            share|improve this answer









            $endgroup$



            Perhaps you could use word embeddings to better represent the distance between certain skills. For instance, "Python" and "R" should be closer together than "Python" and "Time management" since they are both programming languages.



            The whole idea is that words that appear in the same context should be closer.



            Once you have these embeddings, you would have a set of skills for the candidate, and sets of skills of various size for the jobs. You could then use Earth Mover's Distance to calculate the distance between the sets. This distance measure is rather slow (quadratic time) so it might not scale well if you have many jobs to go through.



            To deal with the scalability issue, you could perhaps rank the jobs based on how many skills the candidate has in common in the first place, and favor these jobs.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 30 '18 at 22:26









            Valentin CalommeValentin Calomme

            1,250423




            1,250423























                1












                $begingroup$

                A common and simple method to match "documents" is to use TF-IDF weighting, as you have described. However, as I understand your question, you want to rank each career (-document) based on a set of users skills.



                If you create a "query vector" from the skills, you can multiply the vector with your term-career matrix (with all the tf-idf weights as values). The resulting vector would give you a ranking score per career-document which you can use to pick the top-k careers for the set of "query skills".



                E.g. if your query vector $bar{q}$ consists of zeros and ones, and is of size $1 times |terms|$, and your term-document matrix $M$ is of size $|terms| times |documents|$, then $bar{v} M$ would result in a vector of size $1 times |documents|$ with elements equal to the sum of every query term's TF-IDF weight per career document.



                This method of ranking is one of the simplest and many variations exist. The TF-IDF entry on Wikipedia also describes this ranking method briefly. I also found this Q&A on SO about matching documents.






                share|improve this answer











                $endgroup$













                • $begingroup$
                  Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
                  $endgroup$
                  – wacax
                  Mar 26 '18 at 21:38
















                1












                $begingroup$

                A common and simple method to match "documents" is to use TF-IDF weighting, as you have described. However, as I understand your question, you want to rank each career (-document) based on a set of users skills.



                If you create a "query vector" from the skills, you can multiply the vector with your term-career matrix (with all the tf-idf weights as values). The resulting vector would give you a ranking score per career-document which you can use to pick the top-k careers for the set of "query skills".



                E.g. if your query vector $bar{q}$ consists of zeros and ones, and is of size $1 times |terms|$, and your term-document matrix $M$ is of size $|terms| times |documents|$, then $bar{v} M$ would result in a vector of size $1 times |documents|$ with elements equal to the sum of every query term's TF-IDF weight per career document.



                This method of ranking is one of the simplest and many variations exist. The TF-IDF entry on Wikipedia also describes this ranking method briefly. I also found this Q&A on SO about matching documents.






                share|improve this answer











                $endgroup$













                • $begingroup$
                  Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
                  $endgroup$
                  – wacax
                  Mar 26 '18 at 21:38














                1












                1








                1





                $begingroup$

                A common and simple method to match "documents" is to use TF-IDF weighting, as you have described. However, as I understand your question, you want to rank each career (-document) based on a set of users skills.



                If you create a "query vector" from the skills, you can multiply the vector with your term-career matrix (with all the tf-idf weights as values). The resulting vector would give you a ranking score per career-document which you can use to pick the top-k careers for the set of "query skills".



                E.g. if your query vector $bar{q}$ consists of zeros and ones, and is of size $1 times |terms|$, and your term-document matrix $M$ is of size $|terms| times |documents|$, then $bar{v} M$ would result in a vector of size $1 times |documents|$ with elements equal to the sum of every query term's TF-IDF weight per career document.



                This method of ranking is one of the simplest and many variations exist. The TF-IDF entry on Wikipedia also describes this ranking method briefly. I also found this Q&A on SO about matching documents.






                share|improve this answer











                $endgroup$



                A common and simple method to match "documents" is to use TF-IDF weighting, as you have described. However, as I understand your question, you want to rank each career (-document) based on a set of users skills.



                If you create a "query vector" from the skills, you can multiply the vector with your term-career matrix (with all the tf-idf weights as values). The resulting vector would give you a ranking score per career-document which you can use to pick the top-k careers for the set of "query skills".



                E.g. if your query vector $bar{q}$ consists of zeros and ones, and is of size $1 times |terms|$, and your term-document matrix $M$ is of size $|terms| times |documents|$, then $bar{v} M$ would result in a vector of size $1 times |documents|$ with elements equal to the sum of every query term's TF-IDF weight per career document.



                This method of ranking is one of the simplest and many variations exist. The TF-IDF entry on Wikipedia also describes this ranking method briefly. I also found this Q&A on SO about matching documents.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 13 '18 at 19:42









                Stephen Rauch

                1,51551129




                1,51551129










                answered Mar 13 '18 at 19:14









                KorkiBuziekKorkiBuziek

                6116




                6116












                • $begingroup$
                  Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
                  $endgroup$
                  – wacax
                  Mar 26 '18 at 21:38


















                • $begingroup$
                  Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
                  $endgroup$
                  – wacax
                  Mar 26 '18 at 21:38
















                $begingroup$
                Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
                $endgroup$
                – wacax
                Mar 26 '18 at 21:38




                $begingroup$
                Surprisingly, a simple average of word embeddings is often as good as a weighted average of embeddings done with Tf-Idf weights.
                $endgroup$
                – wacax
                Mar 26 '18 at 21:38











                0












                $begingroup$

                Use the Jaccard Index. This will very much serve your purpose.






                share|improve this answer











                $endgroup$


















                  0












                  $begingroup$

                  Use the Jaccard Index. This will very much serve your purpose.






                  share|improve this answer











                  $endgroup$
















                    0












                    0








                    0





                    $begingroup$

                    Use the Jaccard Index. This will very much serve your purpose.






                    share|improve this answer











                    $endgroup$



                    Use the Jaccard Index. This will very much serve your purpose.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Jul 4 '18 at 15:42









                    visitor

                    1052




                    1052










                    answered Jan 3 '17 at 10:05









                    Himanshu RaiHimanshu Rai

                    1,27748




                    1,27748























                        0












                        $begingroup$

                        You can try using "gensim". I did a similar project with unstructured data. Gensim gave better scores than standard TFIDF. It also ran faster.






                        share|improve this answer








                        New contributor




                        Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.






                        $endgroup$


















                          0












                          $begingroup$

                          You can try using "gensim". I did a similar project with unstructured data. Gensim gave better scores than standard TFIDF. It also ran faster.






                          share|improve this answer








                          New contributor




                          Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






                          $endgroup$
















                            0












                            0








                            0





                            $begingroup$

                            You can try using "gensim". I did a similar project with unstructured data. Gensim gave better scores than standard TFIDF. It also ran faster.






                            share|improve this answer








                            New contributor




                            Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            $endgroup$



                            You can try using "gensim". I did a similar project with unstructured data. Gensim gave better scores than standard TFIDF. It also ran faster.







                            share|improve this answer








                            New contributor




                            Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            share|improve this answer



                            share|improve this answer






                            New contributor




                            Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            answered 35 mins ago









                            Harsha ReddyHarsha Reddy

                            1




                            1




                            New contributor




                            Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            New contributor





                            Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            Harsha Reddy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16036%2falternatives-to-tf-idf-and-cosine-similarity-when-comparing-documents-of-differi%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Ponta tanko

                                Tantalo (mitologio)

                                Erzsébet Schaár