Process mining with ML












0












$begingroup$


I have a little more general question. My dataset consists of N sequences of events. Example of one sequence could be [A,B,C,D,X,Y] and another [A,B,Z], where letters represent different events. The sequences are at most 80 steps long.



The idea is to predict next letter or next step from known previous events. For very simple example maybe after A will always come B. Next step would be measuring time of each event and the ultimate goal is to predict how long until process reaches specific event.



I tried N-gram, MLP neural network and lastly LSTM network, which had around 80% accuracy.



That would not be bad if the events were balanced in the dataset. To account for that I used weighted loss function in training of the LSTM and then the overall accuracy is around 66%. However the less frequent classes have much much higher accuracy (still not perfect, but higher). How can I create model that will have the best of both? That will learn the less frequent AND the most frequent at the same time.



Also I have read that tree base methods perform very good on unbalanced dataset. However all examples always consider one big timeseries data. My data are many short timeseries. Is it possible to train RandomForest on such data? How?



If you know about different algorithm/method that could be applied to such data please post it :)



Thank you.










share|improve this question









$endgroup$




bumped to the homepage by Community 7 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    0












    $begingroup$


    I have a little more general question. My dataset consists of N sequences of events. Example of one sequence could be [A,B,C,D,X,Y] and another [A,B,Z], where letters represent different events. The sequences are at most 80 steps long.



    The idea is to predict next letter or next step from known previous events. For very simple example maybe after A will always come B. Next step would be measuring time of each event and the ultimate goal is to predict how long until process reaches specific event.



    I tried N-gram, MLP neural network and lastly LSTM network, which had around 80% accuracy.



    That would not be bad if the events were balanced in the dataset. To account for that I used weighted loss function in training of the LSTM and then the overall accuracy is around 66%. However the less frequent classes have much much higher accuracy (still not perfect, but higher). How can I create model that will have the best of both? That will learn the less frequent AND the most frequent at the same time.



    Also I have read that tree base methods perform very good on unbalanced dataset. However all examples always consider one big timeseries data. My data are many short timeseries. Is it possible to train RandomForest on such data? How?



    If you know about different algorithm/method that could be applied to such data please post it :)



    Thank you.










    share|improve this question









    $endgroup$




    bumped to the homepage by Community 7 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      0












      0








      0





      $begingroup$


      I have a little more general question. My dataset consists of N sequences of events. Example of one sequence could be [A,B,C,D,X,Y] and another [A,B,Z], where letters represent different events. The sequences are at most 80 steps long.



      The idea is to predict next letter or next step from known previous events. For very simple example maybe after A will always come B. Next step would be measuring time of each event and the ultimate goal is to predict how long until process reaches specific event.



      I tried N-gram, MLP neural network and lastly LSTM network, which had around 80% accuracy.



      That would not be bad if the events were balanced in the dataset. To account for that I used weighted loss function in training of the LSTM and then the overall accuracy is around 66%. However the less frequent classes have much much higher accuracy (still not perfect, but higher). How can I create model that will have the best of both? That will learn the less frequent AND the most frequent at the same time.



      Also I have read that tree base methods perform very good on unbalanced dataset. However all examples always consider one big timeseries data. My data are many short timeseries. Is it possible to train RandomForest on such data? How?



      If you know about different algorithm/method that could be applied to such data please post it :)



      Thank you.










      share|improve this question









      $endgroup$




      I have a little more general question. My dataset consists of N sequences of events. Example of one sequence could be [A,B,C,D,X,Y] and another [A,B,Z], where letters represent different events. The sequences are at most 80 steps long.



      The idea is to predict next letter or next step from known previous events. For very simple example maybe after A will always come B. Next step would be measuring time of each event and the ultimate goal is to predict how long until process reaches specific event.



      I tried N-gram, MLP neural network and lastly LSTM network, which had around 80% accuracy.



      That would not be bad if the events were balanced in the dataset. To account for that I used weighted loss function in training of the LSTM and then the overall accuracy is around 66%. However the less frequent classes have much much higher accuracy (still not perfect, but higher). How can I create model that will have the best of both? That will learn the less frequent AND the most frequent at the same time.



      Also I have read that tree base methods perform very good on unbalanced dataset. However all examples always consider one big timeseries data. My data are many short timeseries. Is it possible to train RandomForest on such data? How?



      If you know about different algorithm/method that could be applied to such data please post it :)



      Thank you.







      machine-learning lstm sequential-pattern-mining






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Aug 15 '18 at 20:46









      Matúš KošíkMatúš Košík

      1




      1





      bumped to the homepage by Community 7 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 7 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          I suspect that the problem has more to do with your data than with your algorithms. My recommendation is to spend some time studying your data and ensuring that it is a robust representation of the kinds of problems you're expecting to solve. If possible, come up with a way to generate extra data. Given the fact that you already have many permutations, you could perhaps write a script to create additional permutations by modifying existing samples with rules that you know.






          share|improve this answer









          $endgroup$














            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36992%2fprocess-mining-with-ml%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            I suspect that the problem has more to do with your data than with your algorithms. My recommendation is to spend some time studying your data and ensuring that it is a robust representation of the kinds of problems you're expecting to solve. If possible, come up with a way to generate extra data. Given the fact that you already have many permutations, you could perhaps write a script to create additional permutations by modifying existing samples with rules that you know.






            share|improve this answer









            $endgroup$


















              0












              $begingroup$

              I suspect that the problem has more to do with your data than with your algorithms. My recommendation is to spend some time studying your data and ensuring that it is a robust representation of the kinds of problems you're expecting to solve. If possible, come up with a way to generate extra data. Given the fact that you already have many permutations, you could perhaps write a script to create additional permutations by modifying existing samples with rules that you know.






              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                I suspect that the problem has more to do with your data than with your algorithms. My recommendation is to spend some time studying your data and ensuring that it is a robust representation of the kinds of problems you're expecting to solve. If possible, come up with a way to generate extra data. Given the fact that you already have many permutations, you could perhaps write a script to create additional permutations by modifying existing samples with rules that you know.






                share|improve this answer









                $endgroup$



                I suspect that the problem has more to do with your data than with your algorithms. My recommendation is to spend some time studying your data and ensuring that it is a robust representation of the kinds of problems you're expecting to solve. If possible, come up with a way to generate extra data. Given the fact that you already have many permutations, you could perhaps write a script to create additional permutations by modifying existing samples with rules that you know.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Aug 15 '18 at 21:22









                David ShapiroDavid Shapiro

                111




                111






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36992%2fprocess-mining-with-ml%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Erzsébet Schaár

                    Franz Schubert

                    Tantalo (mitologio)