html - Regex to extract all Starred Items URLs from Google Reader JSON file -


sadly announced google reader shutdown mid of year. since have large amount of starred items in google reader i'd them up. possible via google reader takeout. produces file in json format.

now extract of article urls out of several mb large file.

at first thought best use regex url seems better extract needed article urls regex find article urls. prevent extract other urls not needed.

here short example how parts of json file looks:

"published" : 1359723602, "updated" : 1359723602, "canonical" : [ {   "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" } ], "alternate" : [ {   "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/ephjmt-xtn4/",   "type" : "text/html" } ], 

i need urls can find here:

 "canonical" : [ {   "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" } ], 

perhaps in mood how regex have extract these urls?

the benefit have quick , dirty way extract starred items urls google reader import them in services pocket or evernote, once processed.

i know asked regex, think there's better way handle problem. multi-line regular expressions pita, , in case there's no need kind of brain damage.

i start grep, rather regex. -a1 parameter says "return line matches, , 1 after":

grep -a1 "canonical" <file> 

this return lines this:

"canonical" : [ {     "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" 

then, i'd grep again href:

grep -a1 "canonical" <file> | grep "href" 

giving

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" 

now can use awk url:

grep -a1 "canonical" <file> | grep "href" | awk -f'" : "' '{ print $2 }'  

which strips out first quote on url:

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" 

now need rid of quote:

grep -a1 "canonical" <file> | grep "href" | awk -f'" : "' '{ print $2 }' | tr -d '"' 

that's it!

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/ 

Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -