'[R] Help request: Parsing docx files for key words and appending to a spreadsheet'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-help
Subject:    [R] Help request: Parsing docx files for key words and appending to a spreadsheet
From:       Andy <phaedrusv () gmail ! com>
Date:       2023-12-29 18:14:02
Message-ID: e2368e9b-b11b-89a3-34cd-a7cf51fa8288 () gmail ! com
[Download RAW message or body]

Hello

I am trying to work through a problem, but feel like I've gone down a 
rabbit hole. I'd very much appreciate any help.

The task: I have several directories of multiple (some directories, up 
to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that 
I want to iterate through to append to a spreadsheet only those articles 
that satisfy a condition (i.e., a specific keyword is present for >= 50% 
coverage of the subject matter). Lexis+ has a very specific structure 
and keywords are given in the row "Subject".

I'd like to be able to accomplish the following:

(1) Append the title, the month, the author, the number of words, and 
page number(s) to a spreadsheet

(2) Read each article and extract keywords (in the docs, these are 
listed in 'Subject' section as a list of keywords with a percentage 
showing the extent to which the keyword features in the article (e.g., 
FAST FASHION (72%)) and to append the keyword and the % coverage to the 
same row in the spreadsheet. However, I want to ensure that the keyword 
coverage meets the threshold of >= 50%; if not, then pass onto the next 
article in the directory. Rinse and repeat for the entire directory.

So far, I've tried working through some Stack Overflow-based solutions, 
but most seem to use the textreadr package, which is now deprecated; 
others use either the officer or the officedown packages. However, these 
packages don't appear to do what I want the program to do, at least not 
in any of the examples I have found, nor in the vignettes and relevant 
package manuals I've looked at.

The first point is, is what I am intending to do even possible using R? 
If it is, then where do I start with this? If these docx files were 
converted to UTF-8 plain text, would that make the task easier?

I am not a confident coder, and am really only just getting my head 
around R so appreciate a steep learning curve ahead, but of course, I 
don't know what I don't know, so any pointers in the right direction 
would be a big help.

Many thanks in anticipation

Andy

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[prev in list] [next in list] [prev in thread] [next in thread]