Tagging Unix/Non-Unix logs using NLP












0












$begingroup$


I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.



For example:




Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.




(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data



For the case above, the output should be:




Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE




Here, some words are replaced with their corresponding sample tags.



I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.



Thanks to everyone who is willing to help me with this.










share|improve this question











$endgroup$




bumped to the homepage by Community 1 hour ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    0












    $begingroup$


    I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.



    For example:




    Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.




    (This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data



    For the case above, the output should be:




    Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE




    Here, some words are replaced with their corresponding sample tags.



    I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.



    Thanks to everyone who is willing to help me with this.










    share|improve this question











    $endgroup$




    bumped to the homepage by Community 1 hour ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      0












      0








      0





      $begingroup$


      I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.



      For example:




      Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.




      (This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data



      For the case above, the output should be:




      Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE




      Here, some words are replaced with their corresponding sample tags.



      I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.



      Thanks to everyone who is willing to help me with this.










      share|improve this question











      $endgroup$




      I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.



      For example:




      Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.




      (This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data



      For the case above, the output should be:




      Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE




      Here, some words are replaced with their corresponding sample tags.



      I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.



      Thanks to everyone who is willing to help me with this.







      machine-learning deep-learning nlp regex






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 21 '18 at 19:35









      wacax

      1,91021038




      1,91021038










      asked Dec 21 '18 at 18:39









      Arpit KathuriaArpit Kathuria

      1




      1





      bumped to the homepage by Community 1 hour ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 1 hour ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.



          The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.



          For instance, if you want to search the version in the example:




          import re



          s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'



          re.search(r'Releasing version(.*?)for', s).group(1)



          ' 0.0.1 '




          Check out these resources that will help you parse a log in Python using Regex.



          https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/



          https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda



          But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:




          Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.




          And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.



          So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.



          Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf






          share|improve this answer











          $endgroup$













          • $begingroup$
            Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
            $endgroup$
            – Arpit Kathuria
            Dec 22 '18 at 1:53










          • $begingroup$
            Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
            $endgroup$
            – wacax
            Dec 22 '18 at 17:28











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43018%2ftagging-unix-non-unix-logs-using-nlp%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.



          The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.



          For instance, if you want to search the version in the example:




          import re



          s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'



          re.search(r'Releasing version(.*?)for', s).group(1)



          ' 0.0.1 '




          Check out these resources that will help you parse a log in Python using Regex.



          https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/



          https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda



          But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:




          Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.




          And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.



          So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.



          Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf






          share|improve this answer











          $endgroup$













          • $begingroup$
            Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
            $endgroup$
            – Arpit Kathuria
            Dec 22 '18 at 1:53










          • $begingroup$
            Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
            $endgroup$
            – wacax
            Dec 22 '18 at 17:28
















          0












          $begingroup$

          PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.



          The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.



          For instance, if you want to search the version in the example:




          import re



          s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'



          re.search(r'Releasing version(.*?)for', s).group(1)



          ' 0.0.1 '




          Check out these resources that will help you parse a log in Python using Regex.



          https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/



          https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda



          But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:




          Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.




          And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.



          So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.



          Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf






          share|improve this answer











          $endgroup$













          • $begingroup$
            Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
            $endgroup$
            – Arpit Kathuria
            Dec 22 '18 at 1:53










          • $begingroup$
            Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
            $endgroup$
            – wacax
            Dec 22 '18 at 17:28














          0












          0








          0





          $begingroup$

          PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.



          The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.



          For instance, if you want to search the version in the example:




          import re



          s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'



          re.search(r'Releasing version(.*?)for', s).group(1)



          ' 0.0.1 '




          Check out these resources that will help you parse a log in Python using Regex.



          https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/



          https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda



          But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:




          Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.




          And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.



          So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.



          Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf






          share|improve this answer











          $endgroup$



          PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.



          The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.



          For instance, if you want to search the version in the example:




          import re



          s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'



          re.search(r'Releasing version(.*?)for', s).group(1)



          ' 0.0.1 '




          Check out these resources that will help you parse a log in Python using Regex.



          https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/



          https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda



          But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:




          Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.




          And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.



          So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.



          Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 22 '18 at 17:26

























          answered Dec 21 '18 at 20:01









          wacaxwacax

          1,91021038




          1,91021038












          • $begingroup$
            Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
            $endgroup$
            – Arpit Kathuria
            Dec 22 '18 at 1:53










          • $begingroup$
            Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
            $endgroup$
            – wacax
            Dec 22 '18 at 17:28


















          • $begingroup$
            Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
            $endgroup$
            – Arpit Kathuria
            Dec 22 '18 at 1:53










          • $begingroup$
            Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
            $endgroup$
            – wacax
            Dec 22 '18 at 17:28
















          $begingroup$
          Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
          $endgroup$
          – Arpit Kathuria
          Dec 22 '18 at 1:53




          $begingroup$
          Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
          $endgroup$
          – Arpit Kathuria
          Dec 22 '18 at 1:53












          $begingroup$
          Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
          $endgroup$
          – wacax
          Dec 22 '18 at 17:28




          $begingroup$
          Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
          $endgroup$
          – wacax
          Dec 22 '18 at 17:28


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43018%2ftagging-unix-non-unix-logs-using-nlp%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Ponta tanko

          Tantalo (mitologio)

          Erzsébet Schaár