[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-commits
Subject:    [kfilemetadata/externalextractors] /: Add external extractor plugin support
From:       Varun Joshi <varunj.1011 () gmail ! com>
Date:       2016-02-27 5:55:15
Message-ID: E1aZXqV-00018s-2M () scm ! kde ! org
[Download RAW message or body]

Git commit 78ad83082578d07eac9fc196880f8caba927f79b by Varun Joshi.
Committed on 27/02/2016 at 05:37.
Pushed by vjoshi into branch 'externalextractors'.

Add external extractor plugin support

1. Add the ExternalExtractor class that wrap the external extractor process into the \
standard Extractor interface 2. Modify ExtractorCollection to enable it to support \
ExternalExtractors 3. Added an example PyPDF2 extractor plugin

M  +19   -0    README.md
M  +6    -0    autotests/CMakeLists.txt
A  +59   -0    autotests/externalextractortest.cpp     [License: LGPL (v2.1+)]
A  +37   -0    autotests/externalextractortest.h     [License: LGPL (v2.1+)]
M  +1    -0    src/CMakeLists.txt
A  +5    -0    src/config-kfilemetadata.h.in
A  +163  -0    src/externalextractor.cpp     [License: LGPL]
A  +47   -0    src/externalextractor.h     [License: LGPL]
M  +27   -1    src/extractorcollection.cpp
M  +1    -3    src/extractorcollection.h
M  +0    -2    src/extractorplugin.h
M  +2    -0    src/extractors/CMakeLists.txt
A  +6    -0    src/extractors/externalextractors/CMakeLists.txt
A  +44   -0    src/extractors/externalextractors/pdfextractor/main.py
A  +5    -0    src/extractors/externalextractors/pdfextractor/manifest.json

http://commits.kde.org/kfilemetadata/78ad83082578d07eac9fc196880f8caba927f79b

diff --git a/README.md b/README.md
index 19b1a26..291be0a 100644
--- a/README.md
+++ b/README.md
@@ -48,6 +48,25 @@ The ExtractionResult should also be given a list of types. These \
types are  defined in the `types.h` header. The correspond to a higher level overview
 of the files which the user typically expects.
 
+## Writing an external plugin
+
+Extractors and Writers can also be written in other languages and installed into the \
system, +and KFileMetaData will be able to find them and use them.
+
+An external plugin must be an independently executable file (a binary,
+script with a hashbang line with the executable permission set, a batch file or
+cmd script, etc). They must be located within libexec directory.
+
+KFileMetaData will wrap each external extractor with an instance of the \
`ExternalExtractor` class, and every writer with `ExternalWriter`. The application \
will be free to choose any of the plugins returned by `WriterCollection` or \
`ExtractorCollection`. +
+Every external plugin will be placed within a directory in \
libexec/kf5/kfilemetadata/externalextractors. Every plugin shall have a manifest.json \
file that specifies the mimetypes that the plugin supports and the main executable \
file. A sample manifest file is located at \
src/writers/externalwriters/example/manifest.json. +
+Both kinds of plugins accept the target file as an argument.
+
+### Writing an external extractor
+
+Extractors take JSON formatted input specifying the input mimetype, and return JSON \
output with the extracted properties. The JSON output also indicates any errors that \
might have occurred. Calls to the extractor are blocking, hence there is a time limit \
for how long they can run. +
 ## Links
 - Mailing list: <https://mail.kde.org/mailman/listinfo/kde-devel>
 - IRC channel: #kde-devel on Freenode
diff --git a/autotests/CMakeLists.txt b/autotests/CMakeLists.txt
index 9d30836..0f59660 100644
--- a/autotests/CMakeLists.txt
+++ b/autotests/CMakeLists.txt
@@ -97,3 +97,9 @@ if(TAGLIB_FOUND)
         LINK_LIBRARIES Qt5::Test KF5::FileMetaData ${TAGLIB_LIBRARIES}
     )
 endif()
+
+
+ecm_add_test(externalextractortest.cpp ../src/externalextractor.cpp
+        TEST_NAME "externalextractortest"
+        LINK_LIBRARIES Qt5::Test KF5::FileMetaData KF5::I18n
+    )
diff --git a/autotests/externalextractortest.cpp \
b/autotests/externalextractortest.cpp new file mode 100644
index 0000000..bd0f502
--- /dev/null
+++ b/autotests/externalextractortest.cpp
@@ -0,0 +1,59 @@
+/*
+ * <one line to give the library's name and an idea of what it does.>
+ * Copyright (C) 2014  Vishesh Handa <me@vhanda.in>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#include "externalextractortest.h"
+#include "simpleextractionresult.h"
+#include "indexerextractortestsconfig.h"
+#include "externalextractor.h"
+#include "config-kfilemetadata.h"
+
+#include <QDebug>
+#include <QTest>
+#include <QDir>
+
+using namespace KFileMetaData;
+
+QString ExternalExtractorTest::testFilePath(const QString& fileName) const
+{
+    return QLatin1String(INDEXER_TESTS_SAMPLE_FILES_PATH) + QDir::separator() + \
fileName; +}
+
+void ExternalExtractorTest::test()
+{
+    QScopedPointer<ExtractorPlugin> plugin(
+                new ExternalExtractor(
+                    QStringLiteral(LIBEXEC_INSTALL_DIR) +
+                    QStringLiteral("/kfilemetadata/externalextractors/pdfextractor")
+    ));
+
+    SimpleExtractionResult result(testFilePath("test.pdf"), "application/pdf");
+    plugin->extract(&result);
+
+    QCOMPARE(result.properties().value(Property::Author), \
QVariant(QStringLiteral("Happy Man"))); +    \
QCOMPARE(result.properties().value(Property::Title), QVariant(QStringLiteral("The Big \
Brown Bear"))); +    QCOMPARE(result.properties().value(Property::Subject), \
QVariant(QStringLiteral("PDF Metadata"))); +
+    QString dt("D:20140701153850+02\'00\'");
+    QCOMPARE(result.properties().value(Property::CreationDate), QVariant(dt));
+
+    QCOMPARE(result.properties().size(), 4);
+}
+
+QTEST_MAIN(ExternalExtractorTest)
diff --git a/autotests/externalextractortest.h b/autotests/externalextractortest.h
new file mode 100644
index 0000000..2ec3ea5
--- /dev/null
+++ b/autotests/externalextractortest.h
@@ -0,0 +1,37 @@
+/*
+ * <one line to give the library's name and an idea of what it does.>
+ * Copyright (C) 2014  Vishesh Handa <me@vhanda.in>
+ * Copyright (C) 2016  Varun Joshi <varunj.1011@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#ifndef EXTERNALEXTRACTORTEST_H
+#define EXTERNALEXTRACTORTEST_H
+
+#include <QObject>
+
+class ExternalExtractorTest : public QObject
+{
+    Q_OBJECT
+private:
+    QString testFilePath(const QString& fileName) const;
+
+private Q_SLOTS:
+    void test();
+};
+
+#endif // EXTERNALEXTRACTORTEST_H
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a549085..ebc2fa5 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -4,6 +4,7 @@ add_library(KF5FileMetaData
     extractor.cpp
     extractorplugin.cpp
     extractorcollection.cpp
+    externalextractor.cpp
     propertyinfo.cpp
     typeinfo.cpp
     usermetadata.cpp
diff --git a/src/config-kfilemetadata.h.in b/src/config-kfilemetadata.h.in
new file mode 100644
index 0000000..90b7433
--- /dev/null
+++ b/src/config-kfilemetadata.h.in
@@ -0,0 +1,5 @@
+#ifndef CONFIGKFILEMETADATA_H
+#define CONFIGKFILEMETADATA_H
+#define LIBEXEC_INSTALL_DIR "${CMAKE_INSTALL_PREFIX}/${KF5_LIBEXEC_INSTALL_DIR}"
+
+#endif // CONFIGKFILEMETADATA_H
diff --git a/src/externalextractor.cpp b/src/externalextractor.cpp
new file mode 100644
index 0000000..f36329e
--- /dev/null
+++ b/src/externalextractor.cpp
@@ -0,0 +1,163 @@
+/*
+ * This file is part of the KFileMetaData project
+ * Copyright (C) 2016  Varun Joshi <varunj.1011@gmail.com>
+ * Copyright (C) 2015  Boudhayan Gupta <bgupta@kde.org>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) version 3, or any
+ * later version accepted by the membership of KDE e.V. (or its
+ * successor approved by the membership of KDE e.V.), which shall
+ * act as a proxy defined in Section 6 of version 3 of the license.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#include "externalextractor.h"
+
+#include <QDebug>
+#include <QDir>
+#include <QProcess>
+#include <QJsonDocument>
+#include <QJsonObject>
+#include <QJsonArray>
+
+#include <KLocalizedString>
+
+#include "properties.h"
+#include "propertyinfo.h"
+#include "typeinfo.h"
+
+#define EXTRACTOR_TIMEOUT_MS 30000
+
+using namespace KFileMetaData;
+
+struct ExternalExtractor::ExternalExtractorPrivate {
+    QString path;
+    QStringList writeMimetypes;
+    QString mainPath;
+};
+
+
+ExternalExtractor::ExternalExtractor(QObject* parent)
+    : ExtractorPlugin(parent),
+      d_ptr(new ExternalExtractorPrivate)
+{
+}
+
+ExternalExtractor::ExternalExtractor(const QString& pluginPath)
+    : ExtractorPlugin(new QObject()),
+      d_ptr(new ExternalExtractorPrivate)
+{
+    Q_D(ExternalExtractor);
+
+    d->path = pluginPath;
+
+    QDir pluginDir(pluginPath);
+    QStringList pluginDirContents = pluginDir.entryList();
+
+    if (!pluginDirContents.contains(QStringLiteral("manifest.json"))) {
+        qDebug() << i18n("Path does not seem to contain a valid plugin");
+        return;
+    }
+
+    QFile manifest(pluginDir.filePath(QStringLiteral("manifest.json")));
+    manifest.open(QIODevice::ReadOnly);
+    QJsonDocument manifestDoc = QJsonDocument::fromJson(manifest.readAll());
+    if (!manifestDoc.isObject()) {
+        qDebug() << i18n("Manifest does not seem to be a valid JSON Object");
+        return;
+    }
+
+    QJsonObject rootObject = manifestDoc.object();
+    QJsonArray mimetypesArray = \
rootObject.value(QStringLiteral("mimetypes")).toArray(); +    QStringList mimetypes;
+    Q_FOREACH(QVariant mimetype, mimetypesArray) {
+        mimetypes << mimetype.toString();
+    }
+
+    d->writeMimetypes.append(mimetypes);
+    d->mainPath = pluginDir.filePath(rootObject[QStringLiteral("main")].toString());
+}
+
+ExternalExtractor::~ExternalExtractor()
+{
+    delete d_ptr;
+}
+
+QStringList ExternalExtractor::mimetypes() const
+{
+    Q_D(const ExternalExtractor);
+
+    return d->writeMimetypes;
+}
+
+void ExternalExtractor::extract(ExtractionResult* result)
+{
+    Q_D(ExternalExtractor);
+
+    QJsonDocument writeData;
+    QJsonObject writeRootObject;
+    QByteArray output;
+    QByteArray errorOutput;
+
+    writeRootObject[QStringLiteral("path")] = QJsonValue(result->inputUrl());
+    writeRootObject[QStringLiteral("mimetype")] = result->inputMimetype();
+    writeData.setObject(writeRootObject);
+
+    QProcess writerProcess;
+    writerProcess.start(d->mainPath, QIODevice::ReadWrite);
+    writerProcess.write(writeData.toJson());
+    writerProcess.closeWriteChannel();
+    writerProcess.waitForFinished(EXTRACTOR_TIMEOUT_MS);
+
+    output = writerProcess.readAll();
+    errorOutput = writerProcess.readAllStandardError();
+
+    if (writerProcess.exitStatus()) {
+        qDebug() << errorOutput;
+        return;
+    }
+
+    // now we read in the output (which is a standard json format) into the
+    // ExtractionResult
+
+    QJsonDocument extractorData = QJsonDocument::fromJson(output);
+    if (!extractorData.isObject()) {
+        return;
+    }
+    QJsonObject rootObject = extractorData.object();
+    QJsonObject propertiesObject = \
rootObject[QStringLiteral("properties")].toObject(); +
+    Q_FOREACH(auto key, propertiesObject.keys()) {
+        if (key == QStringLiteral("typeInfo")) {
+            TypeInfo info = \
TypeInfo::fromName(propertiesObject.value(key).toString()); +            \
result->addType(info.type()); +            continue;
+        }
+
+        // for plaintext extraction
+        if (key == QStringLiteral("text")) {
+            result->append(propertiesObject.value(key).toString(QStringLiteral("")));
 +            continue;
+        }
+
+        PropertyInfo info = PropertyInfo::fromName(key);
+        if (info.name() != key) {
+            continue;
+        }
+        result->add(info.property(), propertiesObject.value(key).toVariant());
+    }
+
+    if (rootObject[QStringLiteral("status")].toString() != QStringLiteral("OK")) {
+        qDebug() << rootObject[QStringLiteral("error")].toString();
+    }
+}
diff --git a/src/externalextractor.h b/src/externalextractor.h
new file mode 100644
index 0000000..b1f8ca0
--- /dev/null
+++ b/src/externalextractor.h
@@ -0,0 +1,47 @@
+/*
+ * This file is part of the KFileMetaData project
+ * Copyright (C) 2016  Varun Joshi <varunj.1011@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) version 3, or any
+ * later version accepted by the membership of KDE e.V. (or its
+ * successor approved by the membership of KDE e.V.), which shall
+ * act as a proxy defined in Section 6 of version 3 of the license.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#ifndef EXTERNALEXTRACTOR_H
+#define EXTERNALEXTRACTOR_H
+
+#include "extractorplugin.h"
+
+namespace KFileMetaData {
+
+class ExternalExtractor : public ExtractorPlugin
+{
+public:
+    explicit ExternalExtractor(QObject* parent = 0);
+    ExternalExtractor(const QString& pluginPath);
+    virtual ~ExternalExtractor();
+
+    QStringList mimetypes() const Q_DECL_OVERRIDE;
+    void extract(ExtractionResult* result) Q_DECL_OVERRIDE;
+
+private:
+    struct ExternalExtractorPrivate;
+    ExternalExtractorPrivate *d_ptr;
+    Q_DECLARE_PRIVATE(ExternalExtractor)
+};
+}
+
+#endif // EXTERNALEXTRACTOR_H
diff --git a/src/extractorcollection.cpp b/src/extractorcollection.cpp
index a1bde65..1bcdc80 100644
--- a/src/extractorcollection.cpp
+++ b/src/extractorcollection.cpp
@@ -1,6 +1,7 @@
 /*
  * <one line to give the library's name and an idea of what it does.>
  * Copyright (C) 2012  Vishesh Handa <me@vhanda.in>
+ * Copyright (C) 2016  Varun Joshi <varunj.1011@gmail.com>
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public
@@ -22,12 +23,15 @@
 #include "extractorplugin.h"
 #include "extractorcollection.h"
 #include "extractor_p.h"
+#include "externalextractor.h"
 
 #include <QDebug>
 #include <QCoreApplication>
 #include <QPluginLoader>
 #include <QDir>
 
+#include "config-kfilemetadata.h"
+
 using namespace KFileMetaData;
 
 class ExtractorCollection::Private {
@@ -60,6 +64,8 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const  {
     QStringList plugins;
     QStringList pluginPaths;
+    QStringList externalPlugins;
+    QStringList externalPluginPaths;
 
     QStringList paths = QCoreApplication::libraryPaths();
     Q_FOREACH (const QString& libraryPath, paths) {
@@ -72,7 +78,7 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const  
         QStringList entryList = dir.entryList(QDir::Files | QDir::NoDotAndDotDot);
         Q_FOREACH (const QString& fileName, entryList) {
-            // Make sure the same plugin is not loaded twice, even if it
+            // Make sure the same plugin is not loaded twice, even if it is
             // installed in two different locations
             if (plugins.contains(fileName))
                 continue;
@@ -83,6 +89,18 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const  }
     plugins.clear();
 
+    QDir externalPluginDir(QStringLiteral(LIBEXEC_INSTALL_DIR) + \
QStringLiteral("/kfilemetadata/externalextractors")); +    // For external plugins, \
we look into the directories +    QStringList externalPluginEntryList = \
externalPluginDir.entryList(QDir::Dirs | QDir::NoDotAndDotDot); +    Q_FOREACH (const \
QString& externalPlugin, externalPluginEntryList) { +        if \
(externalPlugins.contains(externalPlugin)) +            continue;
+
+        externalPlugins << externalPlugin;
+        externalPluginPaths << externalPluginDir.absoluteFilePath(externalPlugin);
+    }
+    externalPlugins.clear();
+
     QList<Extractor*> extractors;
     Q_FOREACH (const QString& pluginPath, pluginPaths) {
         QPluginLoader loader(pluginPath);
@@ -111,6 +129,14 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const  }
     }
 
+    Q_FOREACH (const QString& externalPluginPath, externalPluginPaths) {
+        ExternalExtractor *plugin = new ExternalExtractor(externalPluginPath);
+        Extractor* extractor = new Extractor;
+        extractor->d->m_plugin = plugin;
+
+        extractors << extractor;
+    }
+
     return extractors;
 }
 
diff --git a/src/extractorcollection.h b/src/extractorcollection.h
index 8542aed..d4e7796 100644
--- a/src/extractorcollection.h
+++ b/src/extractorcollection.h
@@ -21,13 +21,11 @@
 #ifndef _KFILEMETADATA_EXTRACTORCOLLECTION_H
 #define _KFILEMETADATA_EXTRACTORCOLLECTION_H
 
+#include "extractor.h"
 #include "kfilemetadata_export.h"
 
 namespace KFileMetaData
 {
-
-class Extractor;
-
 /**
  * \class ExtractorCollection extractorcollection.h
  *
diff --git a/src/extractorplugin.h b/src/extractorplugin.h
index 65abad3..a7ad3e6 100644
--- a/src/extractorplugin.h
+++ b/src/extractorplugin.h
@@ -27,8 +27,6 @@
 #include "kfilemetadata_export.h"
 #include "extractionresult.h"
 
-#include <QStringList>
-
 namespace KFileMetaData
 {
 
diff --git a/src/extractors/CMakeLists.txt b/src/extractors/CMakeLists.txt
index 5dd223e..c62d170 100644
--- a/src/extractors/CMakeLists.txt
+++ b/src/extractors/CMakeLists.txt
@@ -170,3 +170,5 @@ if (QMOBIPOCKET_FOUND)
     DESTINATION ${PLUGIN_INSTALL_DIR}/kf5/kfilemetadata)
 
 endif()
+
+add_subdirectory(externalextractors)
diff --git a/src/extractors/externalextractors/CMakeLists.txt \
b/src/extractors/externalextractors/CMakeLists.txt new file mode 100644
index 0000000..b14c207
--- /dev/null
+++ b/src/extractors/externalextractors/CMakeLists.txt
@@ -0,0 +1,6 @@
+install(
+    DIRECTORY pdfextractor
+    DESTINATION ${KF5_LIBEXEC_INSTALL_DIR}/kfilemetadata/externalextractors
+    PATTERN "*.py"
+    PERMISSIONS WORLD_READ OWNER_WRITE WORLD_EXECUTE
+    )
diff --git a/src/extractors/externalextractors/pdfextractor/main.py \
b/src/extractors/externalextractors/pdfextractor/main.py new file mode 100755
index 0000000..0a2b9df
--- /dev/null
+++ b/src/extractors/externalextractors/pdfextractor/main.py
@@ -0,0 +1,44 @@
+#!/usr/bin/python
+
+import sys
+import subprocess
+import os.path
+import os
+import json
+
+from PyPDF2 import PdfFileReader
+from PyPDF2.generic import NameObject
+
+extractor_data = json.loads(sys.stdin.read())
+
+def extract():
+     path = extractor_data.get('path')
+     mimetype = extractor_data.get('mimetype')
+
+     reader = PdfFileReader(path)
+     document_info = reader.getDocumentInfo()
+
+     properties = {}
+     for property in document_info:
+
+         if property == NameObject('/Author'):
+              properties['author'] = document_info[NameObject('/Author')]
+
+         if property == NameObject('/CreationDate'):
+              properties['creationDate'] = \
document_info[NameObject('/CreationDate')] +
+         if property == NameObject('/Subject'):
+              properties['subject'] = document_info[NameObject('/Subject')]
+
+         if property == NameObject('/Title'):
+              properties['title'] = document_info[NameObject('/Title')]
+
+     return_value = {}
+     return_value['properties'] = properties
+     return_value['status'] = 'OK'
+     return_value['error'] = ''
+
+     print(json.dumps(return_value))
+
+if __name__ == "__main__":
+    extract()
diff --git a/src/extractors/externalextractors/pdfextractor/manifest.json \
b/src/extractors/externalextractors/pdfextractor/manifest.json new file mode 100644
index 0000000..cf3bbbf
--- /dev/null
+++ b/src/extractors/externalextractors/pdfextractor/manifest.json
@@ -0,0 +1,5 @@
+{
+    "main": "main.py",
+    "mimetypes": ["application/pdf"],
+    "language": "python"
+}


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic