[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    seful regular expressions compression, email addresses, SSN, form repacements and SSN
From:       Roger DeAngelis <rogerjdeangelis () GMAIL ! COM>
Date:       2016-08-30 20:28:00
Message-ID: 2211739815061032.WA.rogerjdeangelisgmail.com () listserv ! uga ! edu
[Download RAW message or body]

/* T0099740 Stackoverflow SAS: Useful regular expressions compression, email \
addresses, SSN, form repacements and SSN

http://stackoverflow.com/questions/39076568/remove-certain-part-of-expression

Some Useful regular expressions

1.  REMOVE (...) - Parentheses ans everything between
2.  Social Secutity number
    http://regexlib.com/Search.aspx?k=ssn&AspxAutoDetectCookieSupport=1
3.  Does the list contain valid  email addresses
4.  Form character substitution '0A'x to \n

STRING COMPRSSION

HAVE

VAR=a b(ref='aaa') c d(ref='zzz')

%let var=a b(ref='aaa') c d(ref='zzz');
%put &=var;

WANT TO REMOVE (..) TO GET

VAR=a b c d


SOLUTION

%let var=a b(ref='aaa') c d(ref='zzz');

data _null_;
  var=prxchange("s/\([^)]*\)//",-1,"&var");
  put var=;
run;

*  simple

  1. \(      -> escape and look for open paranethese
  2. [^)]    -> matches all characters except ')'
  3. *       -> matches the previous  ( to ) one or more times
  4. //      -> remove (...) by sustituting nothing

X=a b c d

{[^)] Matcheshes a single character that is not contained within the brackets.
      For example, [^abc] matches any character other than "a", "b", or "c".

*  Matches the preceding element zero or more times.
   For example, ab*c matches "ac", "abc", "abbbc", etc.


VALIDATING SSNs

/* T000850 Regular expressions US social security numbers

http://regexlib.com/Search.aspx?k=ssn

^\d{3}-\d{2}-\d{4}$
This regular expression will match a hyphen-separated Social Security Number (SSN) in \
the format NNN-NN-NNNN.

^(?!000)(?!666)(?<SSN3>[0-6]\d{2}|7(?:[0-6]\d|7[012]))([- \
]?)(?!00)(?<SSN2>\d\d)\1(?!0000)(?<SSN4>\d{4})$

Updated on 3/4/2004 per feedback to additionally exclude SSNs that begin with 666 \
which, as reported, are also not valid. Regular expression for validating US Social \
Security Numbers. Accepts optional hyphens or spaces as formatting characters. Parses \
the three subfields of the SSN into three named sub-strings (SSN1, SSN2, and SSN3) to \
facilitate program use. Rejects matches on all
 zeros for any individual subfield of the Social Security Number. Matches only on \
those SSNs that fall within the range of numbers currently allocated by the Social \
Security Administration.

^(?!000)([0-6]\d{2}|7([0-6]\d|7[012])) ([ -])? (?!00)\d\d([ -|])? (?!0000)\d{4}$

U.S. social security numbers (SSN), within the range of numbers that have been \
currently allocated. Matches the pattern AAA-GG-SSSS, AAA GG SSSS, AAA-GG SSSS, AAA \
GG-SSSS, AAAGGSSSS, AAA-GGSSSS,  AAAGG-SSSS, AAAGG SSSS or AAA GGSSSS. All zero in \
any one field is not allowed. ** Additionally, spaces and/or dashes and/or nothing \
are allowed. In Michael Ash's example 123-45 6789 and 12 3456789 would fail there was \
a '\3' after the second octet of numbers that seemed to confuse the regex. now any \
combination of spaces, dashes, or nothing will work between the SSN octets. \
BoxerX.com thanks Michael for the regex!

(^(?!000)\d{3}) ([- ]?) ((?!00)\d{2}) ([- ]?) ((?!0000)\d{4})

TThis RegularExpression is used to validate the US - SSN.
This regular expression wont allow characters as well as all zeros


^(?!000)(?!666)(?!9)\d{3}([- ]?)(?!00)\d{2}\1(?!0000)\d{4}$

Updated on 3/4/2004 per feedback to additionally exclude
SSNs that begin with 666 which, as reported, are also not valid.
Regular expression for validating US Social Security Numbers.
Accepts optional hyphens or spaces as formatting characters.
Parses the three subfields of the SSN into three named sub-strings
(SSN1, SSN2, and SSN3) to facilitate program use. Rejects matches on all
 zeros for any individual subfield of the Social Security Number.
Matches only on those SSNs that fall within the range of
numbers currently allocated by the Social Security Administration.

^((?!000)(?!666)(?:[0-6]\d{2}|7[0-2][0-9]|73[0-3]|7[5-6][0-9]|77[0-2]))-((?!00)\d{2})-((?!0000)\d{4})$


Could not find a regex that truly matched the rules here
http://en.wikipedia.org/wiki/Social_Security_number#Valid_SSNs
So I modified an existing one to match the valid SSN rules.
The first digit set will not match: 000, 666, 734 to 749,
and greater than 772. * Numbers with all zeros in any digit
group (000-xx-####, ###-00-####, ###-xx-0000)

\b(?!000)(?!666)(?!9)[0-9]{3}[ -]?(?!00)[0-9]{2}[ -]?(?!0000)[0-9]{4}\b

Finds 9 digit numbers within word boundaries, not separated or separated
by - or space, not starting with 000, 666, or 900-999, not containing 00 or
0000 in the middle or at the end of SSN (in compliance with current SSN rules).



MATCHING VALID EMAIL ADDRESSES;

http://goo.gl/Waol8M
https://dzone.com/articles/regular-expression-to-validate-a-comma-separated-l?edition=206671&utm_source=Daily%20Digest&utm_medium=email&utm_campaign=dd%202016-08-27


Recently I needed to create a regular expression to validate the format of a \
comma-separated list of email addresses. Just thought I'd share the result in case it \
is of use to anyone:

\w+@\w+\.\w+(,\s*\w+@\w+\.\w+)*
Here's an example of applying the pattern in Java:

// Compile pattern
Pattern emailAddressPattern = Pattern.compile(String.format("%1$s(,\\s*%1$s)*", \
"\\w+@\\w+\\.\\w+")); // Validate addresses
System.out.println(emailAddressPattern.matcher("xyz").matches()); // false
System.out.println(emailAddressPattern.matcher("foo@bar.com").matches()); // true
System.out.println(emailAddressPattern.matcher("foo@bar.com, xyz").matches()); // \
false System.out.println(emailAddressPattern.matcher("foo@bar.com, \
foo@bar.com").matches()); // true


data _null_;

  * good addresses rc=1;
  str='rogerjdeangelis@gmail.com,foo@mac.com,achme@solar.com';
  rc=prxmatch('/\w+@\w+\.\w+(,\s*\w+@\w+\.\w+)*/',str);
  put rc=;

  * if rc>1 or rc=0 then one or more of the email addresses are bad;
  * if rc>1 then it corresponds to a valid address;
  str='mary.com,foo@mac.com';
  rc=prxmatch('/\w+@\w+\.\w+(,\s*\w+@\w+\.\w+)*/',str);
  put rc=;

  * if rc=0 no valid addresses;
  str='mary.com,foo.com';
  rc=prxmatch('/\w+@\w+\.\w+(,\s*\w+@\w+\.\w+)*/',str);
  put rc=;

run;quit;

/* T000250 IS THE EMAIL ADDRESS VALID

ANOTHER EMAIL ADDRESS CHECKER

%let regex='/^([a-zA-Z0-9_\+\-\.]+)@([a-zA-Z0-9_\+\-\.]+)\.([a-zA-Z]{2,5})$/';
data;
  if prxmatch(&regex.,trim('regusers@^%&$-@@ACHME.com'))>0 then put 'EMAIL ADDRESS IS \
OK';  else put "Not Ok";
  if prxmatch(&regex.,trim('mary@foo.com'))>0 then put 'email address is ok';
   else put "Not Ok";
run;

/* T000850 HOW MANY TIMES AAA  APPEARS IN 'AAACCCAAACCCAAACCCAAACCCAAAAAAAAACCC'
   Solution
   Only counts aaa once in long aaaaaaaaaaaaa
   ^ - start with
   a{3} - a for three
   replace a with *
   data _null_ ;
      a='aaacccaaacccaaacccaaacccaaaaaaaaaccc';
      _3a=countc(prxchange('s/([^a]|^)a{3}([^a]|$)/*/',-1,a),'*') ;
      put _3a= ;
   run;


/* T003130 PERL SCAN IT FOR CHARACTERS SUCH AS NEWLINE, TAB, FORM FEED AND REPLACE \
WITH TYPICAL FORMS

A coworker just presented me with this task.  I came up with two
solutions, but I don't like either of them.  He has a text document
and wants to scan it for characters such as newline, tab, form feed,
carriage return, vertical tab.  If found, he wants to replace them
with their typical representation (ie, \n, \t, \f, \r).

\r is MAC OS EOL rnd of line

 data _null_ ;
    a=cats('09'x,'0A'x,'0C'x);;
    a=prxchange('s/\n/\\n/',-1,a);   * 0A;
    a=prxchange('s/\t/\\t/',-1,a);   * 09
    a=prxchange('s/\f/\\f/',-1,a);   * oc
    put a=;
 run;

/* T003780 REMOVING ADJACENT REPEATING CHARACTERS FROM A STRING  */

    You can use the substitution operator to find pairs of characters (or
    runs of characters) and replace them with a single instance. In this
    substitution, we find a character in "(.)". The memory parentheses store
    the matched character in the back-reference "\1" and we use that to
    require that the same thing immediately follow it. We replace that part
    of the string with the character in $1.

            s/(.)\1/$1/g;  or

            s/(.)\1*/$1/;


     /* The (.) finds a character and stores it in $1 the \1 repeats the character
        and the $1 then substitutes the single character for the pair of characters. \
*/

     data onetym;
        str='aaabbcdddzzdddddddeeff';
        str=prxchange('s/(.)\1*/$1/',-1,str);
        put str=;
     run;

     STR=abcdzdef


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic