Tagging Unix/Non-Unix logs using NLP
$begingroup$
I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.
For example:
Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.
(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data
For the case above, the output should be:
Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE
Here, some words are replaced with their corresponding sample tags.
I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.
Thanks to everyone who is willing to help me with this.
machine-learning deep-learning nlp regex
$endgroup$
bumped to the homepage by Community♦ 1 hour ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.
For example:
Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.
(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data
For the case above, the output should be:
Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE
Here, some words are replaced with their corresponding sample tags.
I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.
Thanks to everyone who is willing to help me with this.
machine-learning deep-learning nlp regex
$endgroup$
bumped to the homepage by Community♦ 1 hour ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.
For example:
Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.
(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data
For the case above, the output should be:
Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE
Here, some words are replaced with their corresponding sample tags.
I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.
Thanks to everyone who is willing to help me with this.
machine-learning deep-learning nlp regex
$endgroup$
I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.
For example:
Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.
(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data
For the case above, the output should be:
Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE
Here, some words are replaced with their corresponding sample tags.
I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.
Thanks to everyone who is willing to help me with this.
machine-learning deep-learning nlp regex
machine-learning deep-learning nlp regex
edited Dec 21 '18 at 19:35
wacax
1,91021038
1,91021038
asked Dec 21 '18 at 18:39
Arpit KathuriaArpit Kathuria
1
1
bumped to the homepage by Community♦ 1 hour ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 1 hour ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.
The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.
For instance, if you want to search the version in the example:
import re
s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'
re.search(r'Releasing version(.*?)for', s).group(1)
' 0.0.1 '
Check out these resources that will help you parse a log in Python using Regex.
https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/
https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda
But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.
And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.
Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf
$endgroup$
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43018%2ftagging-unix-non-unix-logs-using-nlp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.
The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.
For instance, if you want to search the version in the example:
import re
s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'
re.search(r'Releasing version(.*?)for', s).group(1)
' 0.0.1 '
Check out these resources that will help you parse a log in Python using Regex.
https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/
https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda
But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.
And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.
Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf
$endgroup$
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
add a comment |
$begingroup$
PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.
The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.
For instance, if you want to search the version in the example:
import re
s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'
re.search(r'Releasing version(.*?)for', s).group(1)
' 0.0.1 '
Check out these resources that will help you parse a log in Python using Regex.
https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/
https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda
But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.
And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.
Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf
$endgroup$
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
add a comment |
$begingroup$
PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.
The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.
For instance, if you want to search the version in the example:
import re
s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'
re.search(r'Releasing version(.*?)for', s).group(1)
' 0.0.1 '
Check out these resources that will help you parse a log in Python using Regex.
https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/
https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda
But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.
And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.
Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf
$endgroup$
PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.
The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.
For instance, if you want to search the version in the example:
import re
s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'
re.search(r'Releasing version(.*?)for', s).group(1)
' 0.0.1 '
Check out these resources that will help you parse a log in Python using Regex.
https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/
https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda
But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.
And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.
So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.
Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf
edited Dec 22 '18 at 17:26
answered Dec 21 '18 at 20:01
wacaxwacax
1,91021038
1,91021038
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
add a comment |
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43018%2ftagging-unix-non-unix-logs-using-nlp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown