How can I detect anomalies/outliers in my online streaming data on a real-time basis?
$begingroup$
Say, I've a huge set of data(infinite in size) consisting of alternating sine wave and step pulses one after the other. What I want from my model is to parse the data sequence wise or point wise and the first time it parses a sine wave and starts facing the step pulses raise an alert as an outlier but as it goes on parsing the data it must recognise the alternating sine and step pulses and treat them as normal pattern. But then if it faces something out of this trend it must treat them as outlier, however if that new pattern repeats constantly it must treat them as normal again. In other words, my model must "remember" what it saw in the past to some extent to predict what is "normal" in the near future and on the basis of that detect anomalies in my constantly streaming data.
I've tried implementing the conventional stateless LSTM to achieve my requirements but LSTM being a supervised learning process needs an initial training and always predicts based on that initially given data. So what happens is if the pattern it recognised while training initially deviates in the test phase it always treats the pattern in the test phase to be an outlier irrespective of how many times it is repeating. Simply put, it fails to update itself with time.
I've gone through relevant papers on 'Anomaly Detection of online streaming data' and found HTM implemented by Numenta and tested on NAB benchmark is the best solution in this respect but I am looking for something open source and absolutely free to use.
Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.
deep-learning classification unsupervised-learning anomaly-detection stacked-lstm
$endgroup$
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
Say, I've a huge set of data(infinite in size) consisting of alternating sine wave and step pulses one after the other. What I want from my model is to parse the data sequence wise or point wise and the first time it parses a sine wave and starts facing the step pulses raise an alert as an outlier but as it goes on parsing the data it must recognise the alternating sine and step pulses and treat them as normal pattern. But then if it faces something out of this trend it must treat them as outlier, however if that new pattern repeats constantly it must treat them as normal again. In other words, my model must "remember" what it saw in the past to some extent to predict what is "normal" in the near future and on the basis of that detect anomalies in my constantly streaming data.
I've tried implementing the conventional stateless LSTM to achieve my requirements but LSTM being a supervised learning process needs an initial training and always predicts based on that initially given data. So what happens is if the pattern it recognised while training initially deviates in the test phase it always treats the pattern in the test phase to be an outlier irrespective of how many times it is repeating. Simply put, it fails to update itself with time.
I've gone through relevant papers on 'Anomaly Detection of online streaming data' and found HTM implemented by Numenta and tested on NAB benchmark is the best solution in this respect but I am looking for something open source and absolutely free to use.
Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.
deep-learning classification unsupervised-learning anomaly-detection stacked-lstm
$endgroup$
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05
$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49
$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05
add a comment |
$begingroup$
Say, I've a huge set of data(infinite in size) consisting of alternating sine wave and step pulses one after the other. What I want from my model is to parse the data sequence wise or point wise and the first time it parses a sine wave and starts facing the step pulses raise an alert as an outlier but as it goes on parsing the data it must recognise the alternating sine and step pulses and treat them as normal pattern. But then if it faces something out of this trend it must treat them as outlier, however if that new pattern repeats constantly it must treat them as normal again. In other words, my model must "remember" what it saw in the past to some extent to predict what is "normal" in the near future and on the basis of that detect anomalies in my constantly streaming data.
I've tried implementing the conventional stateless LSTM to achieve my requirements but LSTM being a supervised learning process needs an initial training and always predicts based on that initially given data. So what happens is if the pattern it recognised while training initially deviates in the test phase it always treats the pattern in the test phase to be an outlier irrespective of how many times it is repeating. Simply put, it fails to update itself with time.
I've gone through relevant papers on 'Anomaly Detection of online streaming data' and found HTM implemented by Numenta and tested on NAB benchmark is the best solution in this respect but I am looking for something open source and absolutely free to use.
Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.
deep-learning classification unsupervised-learning anomaly-detection stacked-lstm
$endgroup$
Say, I've a huge set of data(infinite in size) consisting of alternating sine wave and step pulses one after the other. What I want from my model is to parse the data sequence wise or point wise and the first time it parses a sine wave and starts facing the step pulses raise an alert as an outlier but as it goes on parsing the data it must recognise the alternating sine and step pulses and treat them as normal pattern. But then if it faces something out of this trend it must treat them as outlier, however if that new pattern repeats constantly it must treat them as normal again. In other words, my model must "remember" what it saw in the past to some extent to predict what is "normal" in the near future and on the basis of that detect anomalies in my constantly streaming data.
I've tried implementing the conventional stateless LSTM to achieve my requirements but LSTM being a supervised learning process needs an initial training and always predicts based on that initially given data. So what happens is if the pattern it recognised while training initially deviates in the test phase it always treats the pattern in the test phase to be an outlier irrespective of how many times it is repeating. Simply put, it fails to update itself with time.
I've gone through relevant papers on 'Anomaly Detection of online streaming data' and found HTM implemented by Numenta and tested on NAB benchmark is the best solution in this respect but I am looking for something open source and absolutely free to use.
Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.
deep-learning classification unsupervised-learning anomaly-detection stacked-lstm
deep-learning classification unsupervised-learning anomaly-detection stacked-lstm
asked Nov 10 '18 at 23:01
Goutam BoseGoutam Bose
11
11
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05
$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49
$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05
add a comment |
$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05
$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49
$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05
$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05
$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05
$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49
$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49
$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05
$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.
Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.
However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.
I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.
The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)
There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.
$endgroup$
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41021%2fhow-can-i-detect-anomalies-outliers-in-my-online-streaming-data-on-a-real-time-b%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.
Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.
However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.
I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.
The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)
There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.
$endgroup$
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
add a comment |
$begingroup$
There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.
Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.
However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.
I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.
The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)
There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.
$endgroup$
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
add a comment |
$begingroup$
There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.
Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.
However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.
I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.
The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)
There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.
$endgroup$
There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.
Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.
However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.
I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.
The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)
There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.
answered Nov 11 '18 at 0:46
HarshHarsh
62148
62148
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
add a comment |
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41021%2fhow-can-i-detect-anomalies-outliers-in-my-online-streaming-data-on-a-real-time-b%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05
$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49
$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05