[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-commits
Subject: [kfilemetadata/externalextractors] /: Add external extractor plugin support
From: Varun Joshi <varunj.1011 () gmail ! com>
Date: 2016-02-27 5:55:15
Message-ID: E1aZXqV-00018s-2M () scm ! kde ! org
[Download RAW message or body]
Git commit 78ad83082578d07eac9fc196880f8caba927f79b by Varun Joshi.
Committed on 27/02/2016 at 05:37.
Pushed by vjoshi into branch 'externalextractors'.
Add external extractor plugin support
1. Add the ExternalExtractor class that wrap the external extractor process into the \
standard Extractor interface 2. Modify ExtractorCollection to enable it to support \
ExternalExtractors 3. Added an example PyPDF2 extractor plugin
M +19 -0 README.md
M +6 -0 autotests/CMakeLists.txt
A +59 -0 autotests/externalextractortest.cpp [License: LGPL (v2.1+)]
A +37 -0 autotests/externalextractortest.h [License: LGPL (v2.1+)]
M +1 -0 src/CMakeLists.txt
A +5 -0 src/config-kfilemetadata.h.in
A +163 -0 src/externalextractor.cpp [License: LGPL]
A +47 -0 src/externalextractor.h [License: LGPL]
M +27 -1 src/extractorcollection.cpp
M +1 -3 src/extractorcollection.h
M +0 -2 src/extractorplugin.h
M +2 -0 src/extractors/CMakeLists.txt
A +6 -0 src/extractors/externalextractors/CMakeLists.txt
A +44 -0 src/extractors/externalextractors/pdfextractor/main.py
A +5 -0 src/extractors/externalextractors/pdfextractor/manifest.json
http://commits.kde.org/kfilemetadata/78ad83082578d07eac9fc196880f8caba927f79b
diff --git a/README.md b/README.md
index 19b1a26..291be0a 100644
--- a/README.md
+++ b/README.md
@@ -48,6 +48,25 @@ The ExtractionResult should also be given a list of types. These \
types are defined in the `types.h` header. The correspond to a higher level overview
of the files which the user typically expects.
+## Writing an external plugin
+
+Extractors and Writers can also be written in other languages and installed into the \
system, +and KFileMetaData will be able to find them and use them.
+
+An external plugin must be an independently executable file (a binary,
+script with a hashbang line with the executable permission set, a batch file or
+cmd script, etc). They must be located within libexec directory.
+
+KFileMetaData will wrap each external extractor with an instance of the \
`ExternalExtractor` class, and every writer with `ExternalWriter`. The application \
will be free to choose any of the plugins returned by `WriterCollection` or \
`ExtractorCollection`. +
+Every external plugin will be placed within a directory in \
libexec/kf5/kfilemetadata/externalextractors. Every plugin shall have a manifest.json \
file that specifies the mimetypes that the plugin supports and the main executable \
file. A sample manifest file is located at \
src/writers/externalwriters/example/manifest.json. +
+Both kinds of plugins accept the target file as an argument.
+
+### Writing an external extractor
+
+Extractors take JSON formatted input specifying the input mimetype, and return JSON \
output with the extracted properties. The JSON output also indicates any errors that \
might have occurred. Calls to the extractor are blocking, hence there is a time limit \
for how long they can run. +
## Links
- Mailing list: <https://mail.kde.org/mailman/listinfo/kde-devel>
- IRC channel: #kde-devel on Freenode
diff --git a/autotests/CMakeLists.txt b/autotests/CMakeLists.txt
index 9d30836..0f59660 100644
--- a/autotests/CMakeLists.txt
+++ b/autotests/CMakeLists.txt
@@ -97,3 +97,9 @@ if(TAGLIB_FOUND)
LINK_LIBRARIES Qt5::Test KF5::FileMetaData ${TAGLIB_LIBRARIES}
)
endif()
+
+
+ecm_add_test(externalextractortest.cpp ../src/externalextractor.cpp
+ TEST_NAME "externalextractortest"
+ LINK_LIBRARIES Qt5::Test KF5::FileMetaData KF5::I18n
+ )
diff --git a/autotests/externalextractortest.cpp \
b/autotests/externalextractortest.cpp new file mode 100644
index 0000000..bd0f502
--- /dev/null
+++ b/autotests/externalextractortest.cpp
@@ -0,0 +1,59 @@
+/*
+ * <one line to give the library's name and an idea of what it does.>
+ * Copyright (C) 2014 Vishesh Handa <me@vhanda.in>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#include "externalextractortest.h"
+#include "simpleextractionresult.h"
+#include "indexerextractortestsconfig.h"
+#include "externalextractor.h"
+#include "config-kfilemetadata.h"
+
+#include <QDebug>
+#include <QTest>
+#include <QDir>
+
+using namespace KFileMetaData;
+
+QString ExternalExtractorTest::testFilePath(const QString& fileName) const
+{
+ return QLatin1String(INDEXER_TESTS_SAMPLE_FILES_PATH) + QDir::separator() + \
fileName; +}
+
+void ExternalExtractorTest::test()
+{
+ QScopedPointer<ExtractorPlugin> plugin(
+ new ExternalExtractor(
+ QStringLiteral(LIBEXEC_INSTALL_DIR) +
+ QStringLiteral("/kfilemetadata/externalextractors/pdfextractor")
+ ));
+
+ SimpleExtractionResult result(testFilePath("test.pdf"), "application/pdf");
+ plugin->extract(&result);
+
+ QCOMPARE(result.properties().value(Property::Author), \
QVariant(QStringLiteral("Happy Man"))); + \
QCOMPARE(result.properties().value(Property::Title), QVariant(QStringLiteral("The Big \
Brown Bear"))); + QCOMPARE(result.properties().value(Property::Subject), \
QVariant(QStringLiteral("PDF Metadata"))); +
+ QString dt("D:20140701153850+02\'00\'");
+ QCOMPARE(result.properties().value(Property::CreationDate), QVariant(dt));
+
+ QCOMPARE(result.properties().size(), 4);
+}
+
+QTEST_MAIN(ExternalExtractorTest)
diff --git a/autotests/externalextractortest.h b/autotests/externalextractortest.h
new file mode 100644
index 0000000..2ec3ea5
--- /dev/null
+++ b/autotests/externalextractortest.h
@@ -0,0 +1,37 @@
+/*
+ * <one line to give the library's name and an idea of what it does.>
+ * Copyright (C) 2014 Vishesh Handa <me@vhanda.in>
+ * Copyright (C) 2016 Varun Joshi <varunj.1011@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#ifndef EXTERNALEXTRACTORTEST_H
+#define EXTERNALEXTRACTORTEST_H
+
+#include <QObject>
+
+class ExternalExtractorTest : public QObject
+{
+ Q_OBJECT
+private:
+ QString testFilePath(const QString& fileName) const;
+
+private Q_SLOTS:
+ void test();
+};
+
+#endif // EXTERNALEXTRACTORTEST_H
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a549085..ebc2fa5 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -4,6 +4,7 @@ add_library(KF5FileMetaData
extractor.cpp
extractorplugin.cpp
extractorcollection.cpp
+ externalextractor.cpp
propertyinfo.cpp
typeinfo.cpp
usermetadata.cpp
diff --git a/src/config-kfilemetadata.h.in b/src/config-kfilemetadata.h.in
new file mode 100644
index 0000000..90b7433
--- /dev/null
+++ b/src/config-kfilemetadata.h.in
@@ -0,0 +1,5 @@
+#ifndef CONFIGKFILEMETADATA_H
+#define CONFIGKFILEMETADATA_H
+#define LIBEXEC_INSTALL_DIR "${CMAKE_INSTALL_PREFIX}/${KF5_LIBEXEC_INSTALL_DIR}"
+
+#endif // CONFIGKFILEMETADATA_H
diff --git a/src/externalextractor.cpp b/src/externalextractor.cpp
new file mode 100644
index 0000000..f36329e
--- /dev/null
+++ b/src/externalextractor.cpp
@@ -0,0 +1,163 @@
+/*
+ * This file is part of the KFileMetaData project
+ * Copyright (C) 2016 Varun Joshi <varunj.1011@gmail.com>
+ * Copyright (C) 2015 Boudhayan Gupta <bgupta@kde.org>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) version 3, or any
+ * later version accepted by the membership of KDE e.V. (or its
+ * successor approved by the membership of KDE e.V.), which shall
+ * act as a proxy defined in Section 6 of version 3 of the license.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library. If not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#include "externalextractor.h"
+
+#include <QDebug>
+#include <QDir>
+#include <QProcess>
+#include <QJsonDocument>
+#include <QJsonObject>
+#include <QJsonArray>
+
+#include <KLocalizedString>
+
+#include "properties.h"
+#include "propertyinfo.h"
+#include "typeinfo.h"
+
+#define EXTRACTOR_TIMEOUT_MS 30000
+
+using namespace KFileMetaData;
+
+struct ExternalExtractor::ExternalExtractorPrivate {
+ QString path;
+ QStringList writeMimetypes;
+ QString mainPath;
+};
+
+
+ExternalExtractor::ExternalExtractor(QObject* parent)
+ : ExtractorPlugin(parent),
+ d_ptr(new ExternalExtractorPrivate)
+{
+}
+
+ExternalExtractor::ExternalExtractor(const QString& pluginPath)
+ : ExtractorPlugin(new QObject()),
+ d_ptr(new ExternalExtractorPrivate)
+{
+ Q_D(ExternalExtractor);
+
+ d->path = pluginPath;
+
+ QDir pluginDir(pluginPath);
+ QStringList pluginDirContents = pluginDir.entryList();
+
+ if (!pluginDirContents.contains(QStringLiteral("manifest.json"))) {
+ qDebug() << i18n("Path does not seem to contain a valid plugin");
+ return;
+ }
+
+ QFile manifest(pluginDir.filePath(QStringLiteral("manifest.json")));
+ manifest.open(QIODevice::ReadOnly);
+ QJsonDocument manifestDoc = QJsonDocument::fromJson(manifest.readAll());
+ if (!manifestDoc.isObject()) {
+ qDebug() << i18n("Manifest does not seem to be a valid JSON Object");
+ return;
+ }
+
+ QJsonObject rootObject = manifestDoc.object();
+ QJsonArray mimetypesArray = \
rootObject.value(QStringLiteral("mimetypes")).toArray(); + QStringList mimetypes;
+ Q_FOREACH(QVariant mimetype, mimetypesArray) {
+ mimetypes << mimetype.toString();
+ }
+
+ d->writeMimetypes.append(mimetypes);
+ d->mainPath = pluginDir.filePath(rootObject[QStringLiteral("main")].toString());
+}
+
+ExternalExtractor::~ExternalExtractor()
+{
+ delete d_ptr;
+}
+
+QStringList ExternalExtractor::mimetypes() const
+{
+ Q_D(const ExternalExtractor);
+
+ return d->writeMimetypes;
+}
+
+void ExternalExtractor::extract(ExtractionResult* result)
+{
+ Q_D(ExternalExtractor);
+
+ QJsonDocument writeData;
+ QJsonObject writeRootObject;
+ QByteArray output;
+ QByteArray errorOutput;
+
+ writeRootObject[QStringLiteral("path")] = QJsonValue(result->inputUrl());
+ writeRootObject[QStringLiteral("mimetype")] = result->inputMimetype();
+ writeData.setObject(writeRootObject);
+
+ QProcess writerProcess;
+ writerProcess.start(d->mainPath, QIODevice::ReadWrite);
+ writerProcess.write(writeData.toJson());
+ writerProcess.closeWriteChannel();
+ writerProcess.waitForFinished(EXTRACTOR_TIMEOUT_MS);
+
+ output = writerProcess.readAll();
+ errorOutput = writerProcess.readAllStandardError();
+
+ if (writerProcess.exitStatus()) {
+ qDebug() << errorOutput;
+ return;
+ }
+
+ // now we read in the output (which is a standard json format) into the
+ // ExtractionResult
+
+ QJsonDocument extractorData = QJsonDocument::fromJson(output);
+ if (!extractorData.isObject()) {
+ return;
+ }
+ QJsonObject rootObject = extractorData.object();
+ QJsonObject propertiesObject = \
rootObject[QStringLiteral("properties")].toObject(); +
+ Q_FOREACH(auto key, propertiesObject.keys()) {
+ if (key == QStringLiteral("typeInfo")) {
+ TypeInfo info = \
TypeInfo::fromName(propertiesObject.value(key).toString()); + \
result->addType(info.type()); + continue;
+ }
+
+ // for plaintext extraction
+ if (key == QStringLiteral("text")) {
+ result->append(propertiesObject.value(key).toString(QStringLiteral("")));
+ continue;
+ }
+
+ PropertyInfo info = PropertyInfo::fromName(key);
+ if (info.name() != key) {
+ continue;
+ }
+ result->add(info.property(), propertiesObject.value(key).toVariant());
+ }
+
+ if (rootObject[QStringLiteral("status")].toString() != QStringLiteral("OK")) {
+ qDebug() << rootObject[QStringLiteral("error")].toString();
+ }
+}
diff --git a/src/externalextractor.h b/src/externalextractor.h
new file mode 100644
index 0000000..b1f8ca0
--- /dev/null
+++ b/src/externalextractor.h
@@ -0,0 +1,47 @@
+/*
+ * This file is part of the KFileMetaData project
+ * Copyright (C) 2016 Varun Joshi <varunj.1011@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) version 3, or any
+ * later version accepted by the membership of KDE e.V. (or its
+ * successor approved by the membership of KDE e.V.), which shall
+ * act as a proxy defined in Section 6 of version 3 of the license.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library. If not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#ifndef EXTERNALEXTRACTOR_H
+#define EXTERNALEXTRACTOR_H
+
+#include "extractorplugin.h"
+
+namespace KFileMetaData {
+
+class ExternalExtractor : public ExtractorPlugin
+{
+public:
+ explicit ExternalExtractor(QObject* parent = 0);
+ ExternalExtractor(const QString& pluginPath);
+ virtual ~ExternalExtractor();
+
+ QStringList mimetypes() const Q_DECL_OVERRIDE;
+ void extract(ExtractionResult* result) Q_DECL_OVERRIDE;
+
+private:
+ struct ExternalExtractorPrivate;
+ ExternalExtractorPrivate *d_ptr;
+ Q_DECLARE_PRIVATE(ExternalExtractor)
+};
+}
+
+#endif // EXTERNALEXTRACTOR_H
diff --git a/src/extractorcollection.cpp b/src/extractorcollection.cpp
index a1bde65..1bcdc80 100644
--- a/src/extractorcollection.cpp
+++ b/src/extractorcollection.cpp
@@ -1,6 +1,7 @@
/*
* <one line to give the library's name and an idea of what it does.>
* Copyright (C) 2012 Vishesh Handa <me@vhanda.in>
+ * Copyright (C) 2016 Varun Joshi <varunj.1011@gmail.com>
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
@@ -22,12 +23,15 @@
#include "extractorplugin.h"
#include "extractorcollection.h"
#include "extractor_p.h"
+#include "externalextractor.h"
#include <QDebug>
#include <QCoreApplication>
#include <QPluginLoader>
#include <QDir>
+#include "config-kfilemetadata.h"
+
using namespace KFileMetaData;
class ExtractorCollection::Private {
@@ -60,6 +64,8 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const {
QStringList plugins;
QStringList pluginPaths;
+ QStringList externalPlugins;
+ QStringList externalPluginPaths;
QStringList paths = QCoreApplication::libraryPaths();
Q_FOREACH (const QString& libraryPath, paths) {
@@ -72,7 +78,7 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const
QStringList entryList = dir.entryList(QDir::Files | QDir::NoDotAndDotDot);
Q_FOREACH (const QString& fileName, entryList) {
- // Make sure the same plugin is not loaded twice, even if it
+ // Make sure the same plugin is not loaded twice, even if it is
// installed in two different locations
if (plugins.contains(fileName))
continue;
@@ -83,6 +89,18 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const }
plugins.clear();
+ QDir externalPluginDir(QStringLiteral(LIBEXEC_INSTALL_DIR) + \
QStringLiteral("/kfilemetadata/externalextractors")); + // For external plugins, \
we look into the directories + QStringList externalPluginEntryList = \
externalPluginDir.entryList(QDir::Dirs | QDir::NoDotAndDotDot); + Q_FOREACH (const \
QString& externalPlugin, externalPluginEntryList) { + if \
(externalPlugins.contains(externalPlugin)) + continue;
+
+ externalPlugins << externalPlugin;
+ externalPluginPaths << externalPluginDir.absoluteFilePath(externalPlugin);
+ }
+ externalPlugins.clear();
+
QList<Extractor*> extractors;
Q_FOREACH (const QString& pluginPath, pluginPaths) {
QPluginLoader loader(pluginPath);
@@ -111,6 +129,14 @@ QList<Extractor*> ExtractorCollection::Private::allExtractors() \
const }
}
+ Q_FOREACH (const QString& externalPluginPath, externalPluginPaths) {
+ ExternalExtractor *plugin = new ExternalExtractor(externalPluginPath);
+ Extractor* extractor = new Extractor;
+ extractor->d->m_plugin = plugin;
+
+ extractors << extractor;
+ }
+
return extractors;
}
diff --git a/src/extractorcollection.h b/src/extractorcollection.h
index 8542aed..d4e7796 100644
--- a/src/extractorcollection.h
+++ b/src/extractorcollection.h
@@ -21,13 +21,11 @@
#ifndef _KFILEMETADATA_EXTRACTORCOLLECTION_H
#define _KFILEMETADATA_EXTRACTORCOLLECTION_H
+#include "extractor.h"
#include "kfilemetadata_export.h"
namespace KFileMetaData
{
-
-class Extractor;
-
/**
* \class ExtractorCollection extractorcollection.h
*
diff --git a/src/extractorplugin.h b/src/extractorplugin.h
index 65abad3..a7ad3e6 100644
--- a/src/extractorplugin.h
+++ b/src/extractorplugin.h
@@ -27,8 +27,6 @@
#include "kfilemetadata_export.h"
#include "extractionresult.h"
-#include <QStringList>
-
namespace KFileMetaData
{
diff --git a/src/extractors/CMakeLists.txt b/src/extractors/CMakeLists.txt
index 5dd223e..c62d170 100644
--- a/src/extractors/CMakeLists.txt
+++ b/src/extractors/CMakeLists.txt
@@ -170,3 +170,5 @@ if (QMOBIPOCKET_FOUND)
DESTINATION ${PLUGIN_INSTALL_DIR}/kf5/kfilemetadata)
endif()
+
+add_subdirectory(externalextractors)
diff --git a/src/extractors/externalextractors/CMakeLists.txt \
b/src/extractors/externalextractors/CMakeLists.txt new file mode 100644
index 0000000..b14c207
--- /dev/null
+++ b/src/extractors/externalextractors/CMakeLists.txt
@@ -0,0 +1,6 @@
+install(
+ DIRECTORY pdfextractor
+ DESTINATION ${KF5_LIBEXEC_INSTALL_DIR}/kfilemetadata/externalextractors
+ PATTERN "*.py"
+ PERMISSIONS WORLD_READ OWNER_WRITE WORLD_EXECUTE
+ )
diff --git a/src/extractors/externalextractors/pdfextractor/main.py \
b/src/extractors/externalextractors/pdfextractor/main.py new file mode 100755
index 0000000..0a2b9df
--- /dev/null
+++ b/src/extractors/externalextractors/pdfextractor/main.py
@@ -0,0 +1,44 @@
+#!/usr/bin/python
+
+import sys
+import subprocess
+import os.path
+import os
+import json
+
+from PyPDF2 import PdfFileReader
+from PyPDF2.generic import NameObject
+
+extractor_data = json.loads(sys.stdin.read())
+
+def extract():
+ path = extractor_data.get('path')
+ mimetype = extractor_data.get('mimetype')
+
+ reader = PdfFileReader(path)
+ document_info = reader.getDocumentInfo()
+
+ properties = {}
+ for property in document_info:
+
+ if property == NameObject('/Author'):
+ properties['author'] = document_info[NameObject('/Author')]
+
+ if property == NameObject('/CreationDate'):
+ properties['creationDate'] = \
document_info[NameObject('/CreationDate')] +
+ if property == NameObject('/Subject'):
+ properties['subject'] = document_info[NameObject('/Subject')]
+
+ if property == NameObject('/Title'):
+ properties['title'] = document_info[NameObject('/Title')]
+
+ return_value = {}
+ return_value['properties'] = properties
+ return_value['status'] = 'OK'
+ return_value['error'] = ''
+
+ print(json.dumps(return_value))
+
+if __name__ == "__main__":
+ extract()
diff --git a/src/extractors/externalextractors/pdfextractor/manifest.json \
b/src/extractors/externalextractors/pdfextractor/manifest.json new file mode 100644
index 0000000..cf3bbbf
--- /dev/null
+++ b/src/extractors/externalextractors/pdfextractor/manifest.json
@@ -0,0 +1,5 @@
+{
+ "main": "main.py",
+ "mimetypes": ["application/pdf"],
+ "language": "python"
+}
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic