[prev in list] [next in list] [prev in thread] [next in thread]
List: r-help
Subject: Re: [R] Reading very large text files into R
From: "Ebert,Timothy Aaron" <tebert () ufl ! edu>
Date: 2022-09-30 23:43:01
Message-ID: BN6PR2201MB1553B86BFEE83B5EDD382CD9CF569 () BN6PR2201MB1553 ! namprd22 ! prod ! outlook ! com
[Download RAW message or body]
Truth
Tim
-----Original Message-----
From: R-help <r-help-bounces@r-project.org> On Behalf Of Avi Gross
Sent: Friday, September 30, 2022 7:01 PM
Cc: R help Mailing list <r-help@r-project.org>
Subject: Re: [R] Reading very large text files into R
[External Email]
Those are valid reasons as examining data and cleaning or fixing it is a major thing \
to do before making an analysis or plots. Indeed, an extra column caused by something \
in an earlier column mat have messed up all columns to the right.
My point was about replicating a problem like this may require many more lines from \
the file.
On Fri, Sep 30, 2022, 5:58 PM Ebert,Timothy Aaron <tebert@ufl.edu> wrote:
> The point was more to figure out why most lines have 15 values and
> some give an error indicating that there are 16. Are there notes, or
> an extra comma? Some weather stations fail and give interesting data
> at, before, or after failure. Are the problem lines indicating machine
> failure? Typically code does not randomly enter extra data. Most
> answers appear to assume that the 16th column has been entered at the
> end of the data, but no evidence indicates this is true. If there is
> an initial value at the beginning of the row, then all of the data for that row \
> will be in error if the "16" value is deleted. I am just paranoid enough to suggest \
> looking at one case to make sure all is as assumed.
> Another way to address the problem is to test the data. Are there
> temperatures less than -100 C or greater than 60 C? Why would one ever
> get such a thing? Machine error, or a column misaligned so that
> humidity values are in the temperature column.
>
> Tim
>
> -----Original Message-----
> From: R-help <r-help-bounces@r-project.org> On Behalf Of
> avi.e.gross@gmail.com
> Sent: Friday, September 30, 2022 3:16 PM
> Cc: r-help@r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> [External Email]
>
> Tim and others,
>
> A point to consider is that there are various algorithms in the
> functions used to read in formatted data into data.frame form and they
> vary. Some do a look-ahead of some size to determine things and if
> they find a column that LOOKS LIKE all integers for say the first
> thousand lines, they go and read in that column as integer. If the
> first floating point value is thousands of lines further along, things may go \
> wrong.
> So asking for line/row 16 to have an extra 16th entry/column may work
> fine for an algorithm that looks ahead and concludes there are 16
> columns throughout. Yet a file where the first time a sixteenth entry
> is seen is at line/row 31,459 may well just set the algorithm to
> expect exactly 15 columns and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty
> much what I would have said. I also see the data as flawed and ask
> which rows are the valid ones. If a sixteenth column is allowed, it
> would be better if all other rows had an empty sixteenth column. If
> not allowed, none should have it.
>
> The approach I might take, again as others have noted, is to
> preprocess the data file using some form of stream editor such as AWK
> that automagically reads in a line at a time and parses lines into a
> collection of tokens based on what separates them such as a comma. You
> can then either write out just the first 15 to the output stream if
> your choice is to ignore a spurious sixteenth, or write out all
> sixteen for every line, with the last being some form of null most of
> the time. And, of course, to be more general, you could make two
> passes through the file with the first one determining the maximum
> number of entries as well as what the most common number of entries
> is, and a second pass using that info to normalize the file the way
> you want. And note some of what was mentioned could often be done in
> this preprocessing such as removing any columns you do not want to
> read into R later. Do note such filters may need to handle edge cases like skipping \
> comment lines or treating the row of headers differently.
> As some have shown, you can create your own filters within a language
> like R too and either read in lines and pre-process them as discussed
> or continue on to making your own data.frame and skip the read.table()
> type of functionality. For very large files, though, having multiple
> variations in memory at once may be an issue, especially if they are
> not removed and further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and
> point out the anomaly and ask if their files might be saved
> alternately in a format that can be used without anomalies.
>
> Avi
>
> -----Original Message-----
> From: R-help <r-help-bounces@r-project.org> On Behalf Of Ebert,Timothy
> Aaron
> Sent: Friday, September 30, 2022 7:27 AM
> To: Richard O'Keefe <raoknz@gmail.com>; Nick Wray
> <nickmwray@gmail.com>
> Cc: r-help@r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> Hi Nick,
> Can you post one line of data with 15 entries followed by the next
> line of data with 16 entries?
>
> Tim
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl
> .edu%7C5044822c1b7f45f2b5f408daa337cc46%7C0d4da0f84a314d76ace60a62331e
> 1b84%7C0%7C0%7C638001757304040018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=s1Vx7PfKdb12NTQFAsGPQ5k8oXBylcFyD30xtAnHqYQ%3D&reserved
> =0
> PLEASE do read the posting guide
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%
> 7C5044822c1b7f45f2b5f408daa337cc46%7C0d4da0f84a314d76ace60a62331e1b84%
> 7C0%7C0%7C638001757304040018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&
> sdata=Q3cceHqosMsjflLn5cPgBE58EGdE%2B7riPYhubX%2BwEL8%3D&reserved=
> 0 and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailm \
an%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C5044822c1b7f45f2b5f408daa \
337cc46%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001757304040018%7CUnknown%7CTW \
FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=s1Vx7PfKdb12NTQFAsGPQ5k8oXBylcFyD30xtAnHqYQ%3D&reserved=0
PLEASE do read the posting guide \
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fp \
osting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C5044822c1b7f45f2b5f408daa337cc \
46%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001757304040018%7CUnknown%7CTWFpbGZ \
sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q3cceHqosMsjflLn5cPgBE58EGdE%2B7riPYhubX%2BwEL8%3D&reserved=0
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic