Tagging Unix/Non-Unix logs using NLP

I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.

For example:

Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.

(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data

For the case above, the output should be:

Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE

Here, some words are replaced with their corresponding sample tags.

I have studied techniques like POS tagging, NER, LSTM, but I don't know which one is suitable for this particular problem. How can I gather data from raw output and how to apply those techniques here.

Thanks to everyone who is willing to help me with this.

edited Dec 21 '18 at 19:35

wacax

1,91021038

asked Dec 21 '18 at 18:39

Arpit Kathuria

bumped to the homepage by Community♦ 1 hour ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.

For example:

Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.

(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data

For the case above, the output should be:

Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE

Here, some words are replaced with their corresponding sample tags.

Thanks to everyone who is willing to help me with this.

edited Dec 21 '18 at 19:35

wacax

1,91021038

asked Dec 21 '18 at 18:39

Arpit Kathuria

bumped to the homepage by Community♦ 1 hour ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.

For example:

Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.

(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data

For the case above, the output should be:

Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE

Here, some words are replaced with their corresponding sample tags.

Thanks to everyone who is willing to help me with this.

edited Dec 21 '18 at 19:35

wacax

1,91021038

asked Dec 21 '18 at 18:39

Arpit Kathuria

I have a set of unstructured data consisting command output logs for different operating systems like Unix, Windows, etc.

For example:

Releasing version 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.

(This is just an example, and not related to actual use case). This output is different for different operating systems. I want to perform tagging on this data

For the case above, the output should be:

Releasing version VERSION_NUMBER for PRODUCT_NAME, on releaseDate. The TEST_TYPE is TEST_VALUE and TEST_TYPE is TEST_VALUE

Here, some words are replaced with their corresponding sample tags.

Thanks to everyone who is willing to help me with this.

machine-learning deep-learning nlp regex

edited Dec 21 '18 at 19:35

wacax

1,91021038

asked Dec 21 '18 at 18:39

Arpit Kathuria

edited Dec 21 '18 at 19:35

wacax

1,91021038

asked Dec 21 '18 at 18:39

Arpit Kathuria

edited Dec 21 '18 at 19:35

wacax

1,91021038

edited Dec 21 '18 at 19:35

wacax

1,91021038

edited Dec 21 '18 at 19:35

wacax

1,91021038

asked Dec 21 '18 at 18:39

Arpit Kathuria

asked Dec 21 '18 at 18:39

Arpit Kathuria

asked Dec 21 '18 at 18:39

Arpit Kathuria

bumped to the homepage by Community♦ 1 hour ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 1 hour ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

PoS tagging works for natural language only and identifies grammatical parts of the sentence, nothing more. LSTM is an algorithm that can be used to predict series. Named Entity Recognition (NER) and Terminology Extraction could work if you have already data to engage in Information Extraction (IE). However, In order to use a these techniques you need to have a trained model and in order to train one you need data. In your case that would involve identifying and tagging parts of the sentence by hand and later training a model with that data.

The best approach, in my opinion, is just use regex to identify parts of the sentence as one of the approaches of Information Extraction and use hard coded rules to best identify what you are trying to replace later.

For instance, if you want to search the version in the example:

import re

s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'

re.search(r'Releasing version(.*?)for', s).group(1)

' 0.0.1 '

Check out these resources that will help you parse a log in Python using Regex.

https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/

https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda

But if you prefer to use Named Entity Recognition or Terminology Extraction techniques you could hack a NER model and train it yourself with your data. Keep in mind though, that according to Poibeau, Thierry; Kosseim, Leila (2001). "Proper Name Extraction from Non-Journalistic Texts". Language and Computers. 37 (1): 144–157.:

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.

And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

So even if you pull that off, accuracy will be less than if you just extracted data using regex. An approximate of a reduction of 97% to 93% only for named entities (companies, names, etc.) Accuracy reduction will be much less in you case.

Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf

edited Dec 22 '18 at 17:26

answered Dec 21 '18 at 20:01

wacax

1,91021038

$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53

$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43018%2ftagging-unix-non-unix-logs-using-nlp%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

For instance, if you want to search the version in the example:

import re

s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'

re.search(r'Releasing version(.*?)for', s).group(1)

' 0.0.1 '

Check out these resources that will help you parse a log in Python using Regex.

https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/

https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.

And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf

edited Dec 22 '18 at 17:26

answered Dec 21 '18 at 20:01

wacax

1,91021038

$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53

$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28

add a comment |

For instance, if you want to search the version in the example:

import re

s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'

re.search(r'Releasing version(.*?)for', s).group(1)

' 0.0.1 '

Check out these resources that will help you parse a log in Python using Regex.

https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/

https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.

And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf

edited Dec 22 '18 at 17:26

answered Dec 21 '18 at 20:01

wacax

1,91021038

$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53

$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28

add a comment |

For instance, if you want to search the version in the example:

import re

s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'

re.search(r'Releasing version(.*?)for', s).group(1)

' 0.0.1 '

Check out these resources that will help you parse a log in Python using Regex.

https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/

https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.

And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf

edited Dec 22 '18 at 17:26

answered Dec 21 '18 at 20:01

wacax

1,91021038

For instance, if you want to search the version in the example:

import re

s = 'Releasing version. 0.0.1 for Stackoverflow, on 01/01/2019. The coverage is 99% and build is passed.'

re.search(r'Releasing version(.*?)for', s).group(1)

' 0.0.1 '

Check out these resources that will help you parse a log in Python using Regex.

https://pythonicways.wordpress.com/2016/12/20/log-file-parsing-in-python/

https://medium.com/devops-challenge/apache-log-parser-using-python-8080fbc41dda

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.

And according to Wikipedia: Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Check this link for more information about Information Extraction: https://web.stanford.edu/~jurafsky/slp3/17.pdf

edited Dec 22 '18 at 17:26

answered Dec 21 '18 at 20:01

wacax

1,91021038

edited Dec 22 '18 at 17:26

answered Dec 21 '18 at 20:01

wacax

1,91021038

answered Dec 21 '18 at 20:01

wacax

1,91021038

answered Dec 21 '18 at 20:01

wacax

1,91021038

$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53

$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28

add a comment |

$begingroup$
Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?
$endgroup$
– Arpit Kathuria
Dec 22 '18 at 1:53

$begingroup$
Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.
$endgroup$
– wacax
Dec 22 '18 at 17:28

Actually, the outputs are of different OSs and devices which vary a lot, and new OS/device are added on monthly bases. Would you still recommend regex based approach? We are currently using regex for these by which we can automate data generation, but would that be helpful?

– Arpit Kathuria
Dec 22 '18 at 1:53

Based on what you say, predictions would be based in what regex can already do, in order to gather new data from new operating systems then you need to update the regex to get more data which in turn would already solve what you are trying to predict next. Even if the volume of logs is massive enough to justify a ML model performance will suffer compared to extraction with Regex. Check the edit on the answer for more information.

– wacax
Dec 22 '18 at 17:28

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

v3X6Yl7zy4vc8Zcf

搜尋此網誌

Gfyuki