From koffice-devel Sun Jan 24 23:15:25 2010 From: Jos van den Oever Date: Sun, 24 Jan 2010 23:15:25 +0000 To: koffice-devel Subject: how to extend the new powerpoint parser Message-Id: <201001250015.25759.jos.van.den.oever () kogmbh ! com> X-MARC-Message: https://marc.info/?l=koffice-devel&m=126437498520507 Hi all, The conversion to the new parser is mostly done. The main remaining task is fixing any regressions that turn up in the conversion. The generated code provides a simple value base API. Here is an excerpt: class ExHyperlinkContainer : public StreamOffset { public: OfficeArtRecordHeader rh; ExHyperlinkAtom exHyperlinkAtom; QSharedPointer friendlyNameAtom; QSharedPointer targetAtom; QSharedPointer locationAtom; ExHyperlinkContainer(void* /*dummy*/ = 0) {} }; ExHyperlinkContainer has two obligator members (rh and exHyperlinkAtom) and three optional members (FriendlyNameAtom, TargetAtom, LocationAtom). To know what these members are, look in the documentation [1]. This was generated from this XML: You can see here again which members are optional. You also see instructions on how to parse the code. The member 'rh' has limitations on the values that its members may have. These limitations are taken from the code. If you need to get at a structure which has not yet been added to the parser, you can do so yourself by describing the structure in mso.xml. Check out msoscheme and compile build the generator: # check out the code git clone git://gitorious.org/msoscheme/msoscheme.git # build and test (you need a java compiler and Apache Ant) cd msoscheme && ant # adapt the code in src/mso.xml and regenerate the parsers ant generateParsers # look at the new parsers in cpp/simpleParser.h and cpp/simpleParser.cpp You can check the new mso.xml with the C++ code provided in the project: mkdir build && cd build cmake ../cpp # convert your file to xml ppttoxml $yourfile # print the structure of the file pptstructureprinter $yourfile If you find a file that is not parsed properly, you get output from the koconverter like this: 95515 bytes left at the end of PowerPointStructs, so probably an error at position 7082 If you see this you should find out what structure is defined at position 7082, which is the likely cause of the problem. Use pptstructureprinter for this. It gives output like this: ... 1 15 0 7d0 1012 7054 DocInfoListContainer 2 15 1 3ff 20 7062 VBAInfoContainer 3 2 0 400 12 7070 VBAInfoAtom 2 15 0 3fa 103 7090 SlideViewInfoInstance 3 0 0 3fe 3 7098 SlideViewInfoAtom ... The columns here are 1) nesting level 2) recVer 3) recInstance 4) recType 5) size 6) position You see that position (7082) reported as likely cause for the error is in a VBAInfoAtom. The next step to fix this problem is to look up this structure in mso.xml and compare it with the documentation. In this case the definition in mso.xml matches that in the documentation. So either the description is incomplete or the ppt file was not created in PowerPoint but e.g. in OpenOffice. To see what is going on, put a breakpoint in the function parseVBAInfoAtom and see where the parsing error occurs. The most common error is a discrepancy between the documentation and mso.xml. If the observed file does not match the documentation then add a remark in the mso.xml at the place where you add the exception. Good luck, Jos [1] [MS-PPT].pdf and [MS-ODRAW].pdf -- Jos van den Oever, software architect +49 391 25 19 15 53 http://kogmbh.com/legal/ _______________________________________________ koffice-devel mailing list koffice-devel@kde.org https://mail.kde.org/mailman/listinfo/koffice-devel