¿Existe una biblioteca que analice automáticamente las marcas de tiempo del texto?

Para empezar, he usado Splunk para el análisis de datos de registro. Una de las excelentes características de Splunk es que puede detectar automáticamente la fecha de un evento de registro, sin importar el formato que le dé (es 99% preciso según mi experiencia)

¿Existe una biblioteca existente que sea capaz de esto, que pueda analizar la marca de tiempo en una fecha, independientemente del formato y de dónde se encuentre en la cadena? (Preferiblemente en Java).

Estos son algunos ejemplos (provienen de registros reales):

(Actual Date in format yyyy-MM-dd hh:mm:ss.SSS ZZZ): Log Entry
2015-11-19 13:19:24.000 -0500: 172.24.133.22 23958 online.acme.com - - [19/Nov/2015:13:19:24 -0500] "POST /app/services/jsevents/loginfo?request HTTP/1.1" 204 - - 1898041 "https://online.acme.com/app/web/index/home" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
2015-11-19 13:21:11.000 (no Time Zone): [Thu Nov 19 13:21:11 2015] [error] [client 172.24.133.27] File does not exist: /server/appportal/docroot/app/ui/public/res/css/img, referer: https://online.acme.com/app/web/documents/viewDocuments?req_type=INSDOC&di=151118AH01CT31447828851329810A53&oc_id=ep0109&lob=AUTOHOME&
2015-11-19 13;23:59.912 -0500: [11/19/15 13:23:59:912 EST] 0000009a PmiRmArmWrapp I   PMRM0003I:  parent:ver=1,ip=172.24.237.31,time=1447880900823,pid=28380,reqid=1245500,event=1 - current:ver=1,ip=172.24.237.31,time=1447880900823,pid=28380,reqid=1245505,event=1 type=Web Services Requestor detail=wsrequestor:DocServicesBeanPort.findDoc?transport=https&parameters=xref elapsed=167
2015-11-19 13:29:36.603 (no Time Zone): 2015-11-19 13:29:36,603 [WebContainer : 26] WARN  172.24.133.26 - - - - Default c.a.p.w.c.user.FindUserProfileBean - Invalid User Type Argument. Deferring to default.
2015-11-19 07:00:40.000 (no Time Zone): 19-Nov-2015.07:00:40: [INFO ] com.acme.app.legacy.LegacyConnector  - Succesful CHANGE Event.

Respuestas (1)

Y la mejor respuesta que encontré hasta ahora es una biblioteca de Java llamada Natty (con un poco de ayuda de Regex y Joda Time ). Natty es un analizador de lenguaje natural capaz de analizar todo tipo de fechas en muchos formatos diferentes, pero no siempre es excelente con los tiempos. Para mí, los tiempos son la parte fácil, porque en cada evento de registro que enumeré anteriormente, el tiempo tiene un formato bastante estándar (hh: mm: ss, con un SSS ocasional (para milisegundos) separado por un punto (.) coma (, ) o dos puntos (:). De hecho, estoy muy impresionado con la capacidad de esta biblioteca para analizar una fecha del lenguaje natural en tantas formas diferentes.

Si puedo hacer que Natty me diga dónde está la fecha en una cadena (y refinarla), entonces la hora suele estar bastante cerca, así que puedo usar una expresión regular para obtener la hora.

Para cualquier persona interesada, he publicado un ejemplo de uso de esta biblioteca, con algunos aumentos de expresiones regulares y tiempo de Joda:

package org.joestelmach.natty;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.joda.time.DateTime;
import org.joda.time.MutableDateTime;

import com.joestelmach.natty.DateGroup;
import com.joestelmach.natty.Parser;

public class ParserTest {

    private static Parser parser = new Parser();

    public static void main(String[] args) {
        String [] lines = {
                "172.24.133.22 23958 online.acme.com - - [19/Nov/2015:13:19:24 -0500] \"POST /app/services/jsevents/loginfo?request HTTP/1.1\" 204 - - 1898041 \"https://online.acme.com/app/web/index/home\" \"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)\"",
                "[Thu Nov 19 13:21:11 2015] [error] [client 172.24.133.27] File does not exist: /server/appportal/docroot/app/ui/public/res/css/img, referer: https://online.acme.com/app/web/documents/viewDocuments?req_type=INSDOC&di=151118AH01CT31447828851329810A53&oc_id=ep0109&lob=AUTOHOME&",
                "[11/19/15 13:23:59:912 EST] 0000009a PmiRmArmWrapp I PMRM0003I: parent:ver=1,ip=172.24.237.31,time=1447880900823,pid=28380,reqid=1245500,event=1 - current:ver=1,ip=172.24.237.31,time=1447880900823,pid=28380,reqid=1245505,event=1 type=Web Services Requestor detail=wsrequestor:DocServicesBeanPort.findDoc?transport=https&parameters=xref elapsed=167",
                "2015-11-19 13:29:36,603 [WebContainer : 26] WARN 172.24.133.26 - - - - Default c.a.p.w.c.user.FindUserProfileBean - Invalid User Type Argument. Deferring to default.",
                "19-Nov-2015.07:00:40: [INFO ] com.acme.app.legacy.LegacyConnector - Succesful CHANGE Event.",
                "DEBUG|2015-11-19-01:14:17.628|WebContainer : 0|          TRACE:BEGIN (876.411s) ContractDAO.findByContractNumberAndSuffix",
        };

        SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS ZZZ");

        Pattern timePattern = Pattern.compile("(\\d{2}):(\\d{2}):(\\d{2})(?:[\\.,:](\\d{3}))?");

        final int DATE_TEXT_SPAN_TOLERANCE = 15;

        for (String line : lines) {
            List<DateGroup> dateGroups = parser.parse(line);
            // Refine the list, since it's possible the parser will return multiple matches
            // (though usually only one of them is an actual match)
            DateGroup group = refineDateGroupList(dateGroups);
            List<Date> dateList = group.getDates();
            Date firstDate = dateList.get(0);
            int column = group.getPosition();
            int length = group.getText().length();
            String dateText = group.getText();
            int hour = 0, minute = 0, second = 0, milli = 0;
            boolean matched = false;
            Matcher timeMatcher;
            // First let's check if the time is in the matched date
            // We always check the last match in the regex because of the assumption that the time
            // is usually after the date in the string
            if ((timeMatcher = findLastMatch(dateText, timePattern)) != null) {
                matched = true;
                hour = new Integer(timeMatcher.group(1));
                minute = new Integer(timeMatcher.group(2));
                second = new Integer(timeMatcher.group(3));
                if (timeMatcher.group(4) != null) {
                    milli = new Integer(timeMatcher.group(4));
                }
            }
            // If not, go X characters forward and backward and see if we can find a time
            if (!matched) {
                int timeSearchStart = Math.max(column - DATE_TEXT_SPAN_TOLERANCE, 0);
                int timeSearchEnd = Math.min(column + length + DATE_TEXT_SPAN_TOLERANCE, line.length());
                String timeSearchSubstring = line.substring(timeSearchStart, timeSearchEnd);
                timeMatcher = timePattern.matcher(timeSearchSubstring);
                if ((timeMatcher = findLastMatch(timeSearchSubstring, timePattern)) != null) {
                    hour = new Integer(timeMatcher.group(1));
                    minute = new Integer(timeMatcher.group(2));
                    second = new Integer(timeMatcher.group(3));
                    if (timeMatcher.group(4) != null) {
                        milli = new Integer(timeMatcher.group(4));
                    }
                }
            }
            MutableDateTime jodaTime = new MutableDateTime(firstDate.getTime());
            jodaTime.setHourOfDay(hour);
            jodaTime.setMinuteOfHour(minute);
            jodaTime.setSecondOfMinute(second);
            jodaTime.setMillisOfSecond(milli);
            firstDate = jodaTime.toDate();
            System.out.printf("DATE: %s [%d] (from matched text \"%s\")\n%s\n====\n", dateFormat.format(firstDate), firstDate.getTime(), dateText, line);
        }

    }

    // Refines the date groups returned from the Natty parser by making sure the date
    // retrieved from the entire line is the same as the date retrieved from the matched
    // text
    private static DateGroup refineDateGroupList(List<DateGroup> dateGroups) {
        if (dateGroups.size() == 1) {
            return dateGroups.get(0);
        }
        if (dateGroups.size() == 0) {
            return null;
        }
        for (DateGroup group : dateGroups) {
            List<DateGroup> subDateGroups = parser.parse(group.getText());
            DateGroup subDateGroup = refineDateGroupList(subDateGroups);
            if (subDateGroup == null) {
                return null;
            }
            List<Date> dateList = group.getDates();
            if (dateList.size() == 0) {
                // This shouldn't actually happen
                return null;
            }
            // Choose the first date
            Date expectedDate = dateList.get(0);
            List<Date> subDateList = subDateGroup.getDates();
            if (subDateList.size() == 0) {
                // Again, this shouldn't happen
                return null;
            }
            // Again, choose the first date
            Date actualDate = subDateList.get(0);
            if (isSameDate(expectedDate, actualDate)) {
                return group;
            }
        }
        // If none of them match, the first one wins
        return dateGroups.get(0);
    }

    // Makes sure that the yyyy, MM, and dd are the same between two dates
    private static boolean isSameDate(Date expectedDate, Date actualDate) {
        DateTime expectedDateTime = new DateTime(expectedDate.getTime());
        DateTime actualDateTime = new DateTime(actualDate.getTime());
        return expectedDateTime.year().equals(actualDateTime.year()) &&
                expectedDateTime.monthOfYear().equals(actualDateTime.monthOfYear()) &&
                expectedDateTime.dayOfMonth().equals(actualDateTime.dayOfMonth());
    }

    private static Matcher findLastMatch(String text, Pattern pattern) {
        int length = text.length();
        if (text.length() == 0) return null;
        for (int start = length - 1; start > 0; start --) {
            String subText = text.substring(start, length);
            Matcher matcher = pattern.matcher(subText);
            if (matcher.find()) {
                return matcher;
            }
        }
        return null;
    }

}

Y efectivamente, el resultado de esto es:

DATE: 2015-11-19 13:19:24.000 -0500 [1447957164000] (from matched text "19/Nov/2015")
172.24.133.22 23958 online.acme.com - - [19/Nov/2015:13:19:24 -0500] "POST /app/services/jsevents/loginfo?request HTTP/1.1" 204 - - 1898041 "https://online.acme.com/app/web/index/home" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
====
DATE: 2015-11-19 13:21:11.000 -0500 [1447957271000] (from matched text "Thu Nov 19 13:21:11")
[Thu Nov 19 13:21:11 2015] [error] [client 172.24.133.27] File does not exist: /server/appportal/docroot/app/ui/public/res/css/img, referer: https://online.acme.com/app/web/documents/viewDocuments?req_type=INSDOC&di=151118AH01CT31447828851329810A53&oc_id=ep0109&lob=AUTOHOME&
====
DATE: 2015-11-19 13:23:59.000 -0500 [1447957439000] (from matched text "11/19/15 13:23:59")
[11/19/15 13:23:59:912 EST] 0000009a PmiRmArmWrapp I PMRM0003I: parent:ver=1,ip=172.24.237.31,time=1447880900823,pid=28380,reqid=1245500,event=1 - current:ver=1,ip=172.24.237.31,time=1447880900823,pid=28380,reqid=1245505,event=1 type=Web Services Requestor detail=wsrequestor:DocServicesBeanPort.findDoc?transport=https&parameters=xref elapsed=167
====
DATE: 2015-11-19 13:29:36.000 -0500 [1447957776000] (from matched text "2015-11-19 13:29:36")
2015-11-19 13:29:36,603 [WebContainer : 26] WARN 172.24.133.26 - - - - Default c.a.p.w.c.user.FindUserProfileBean - Invalid User Type Argument. Deferring to default.
====
DATE: 2015-11-19 07:00:40.000 -0500 [1447934440000] (from matched text "19-Nov-2015")
19-Nov-2015.07:00:40: [INFO ] com.acme.app.legacy.LegacyConnector - Succesful CHANGE Event.
====
DATE: 2015-11-19 01:14:17.628 -0500 [1447913657628] (from matched text "2015-11-19")
DEBUG|2015-11-19-01:14:17.628|WebContainer : 0|          TRACE:BEGIN (876.411s) ContractDAO.findByContractNumberAndSuffix
====

Todavía no es a prueba de balas, pero prepara el escenario para exactamente lo que necesito hacer.

FYI, el proyecto Joda-Time ahora está en modo de mantenimiento , y el equipo recomienda la migración a las clases java.time .