Server Notice:

hide

Public Pad Latest text of pad GLAMcampNYC-ut Saved May 25, 2011

 
DATA INGESTION TOOL
Work from GLAMCamp NYC, May 20-22 2011
 
Summary
 
There is a Python library (pywikipediabot[0]) where many of Maarten's bots are  derived from[1]. The desired outcome of this session is that we're turning it a library that functions as a black box for uploading data to Wikimedia  Commons. This library will have one external function "put(metadata,  configuration). It also needs to include a function to check for duplicates.
 
Configuration will be a dictionary that holds the following keys:
- configurationTemplate, holds URL to configuration template
- configurationTitleTemplate, holds configuration of Title Template
- sourceKey, holds the key of the metadata dict that indicate the url of the source
 
An  extra module will be written to ingest different formats and offer its  metadata as dictionary in key-value format (metadata in put(); ). This module can be GUI-ed. Written as a base class with can be subclassed or extended, to enable different standards.
 
 
 
Discussion
 
[there was an email about commons-l about mass upload - he scraped a bunch of images off federal websites
 
 
Jeremy: got the guy who wrote the mail on Skype -- pyrak?
 
Maarten says: I did that for many federal ]
 
 
 
We have a lot of custom tools to do mass upload tools
I wrote several (Maarten)
 
Some other people have also written some
they 
 
 
Pywikipedia - framework for bots
 
that is the basis.
So if Maarten does the upload it's just 2 lines of code.
 
Maarten just builds on top of it, glues libraries together.
Current bots he has are - all work in the same order -- get metadata, check for duplicates, etc.
summed it up at 
 
 
but they are still all custom.  check SVN log - I take latest, mod it, & have another.
 
So I'd like to make a basic bot that includes some logic common to all of them, then we can extend it for future projects
 
then that is a basis for a real generic one 
 
this weekend we can totally hack up a basic generic one! in Python
    most code is already there, just need to glue it together as a puzzle
    extension.... use framework to do an upload with the data
 
Is this the more structural way to go about this? 
 
Kaldari: leaves out 1/2 the equation. How do we get the files uploaded in the first place to get them on commons with the bot? we need a staging system.
 
pywikipediabot can grab a URL, local or remote.
 
what if it is not on the web?
 
should we or should we not cover the use case of "this database is not on the web?"
 
what about a temp staging area? on toolserver?
 
right now we're getting stuck.... institutions to us? not in scope for this weekend?
 
Maarten wants to make small steps, since he does not expect WMF support
 
extending pywikipediabot -- not everyone here codes in Python.  And getting an architecture written would also be a good use of the weekend.
 
getting metadata....
sometimes in a csv, sometimes in a db, sometimes in a repo accessible via API.
Maarten takes original metadata, transforms, puts it on Commons.
 
Kaldari: yes, that is the most important thing that blocks the whole process.  We could I diff tasks that diff people could work on.
 
Maarten: yes, metadata is most important one - Europeana as example
 
MaartenZ: outsider's perspective: helped Maarten a bit with this part.  Maybe WM can ask GLAM for a certain format.  "we only accept foo with this XML structure"
 
Kaldari: yes, vital! assemble a task team to do that
 
and someone from GLAM or Wikipeda can convert stuff.  We can make converters, APIs.
 
This decision.... a big one!  natively accept certain common formats, or just 1?  should we be making this now?
 
Maarten D doesn't want to just talk this weekend & make fake "architecture" doc & have months go by & do nothing!
 
what about web-based?  user-friendly, etc.
 
we have to learn a lot before we can get there.  we cannot specify a final product because we don't know what we want.
 
fast-prototype, be more agile.  refactor the custom bots & make them more generic.  make a timeline.  We're in the custom area, make more generic, at some point we want to make it more web-based.
 
MaartenZ: as a software developer, on the whiteboard -- what are inputs & outputs?
 
We did this during amsterdam hackathon
 
METADATA IN  ->   [blackbox] ----> files in commons
 
part of metadata might be a URL
quicktime video? no, QT is not allowed.  Ogg?
this project has nothing to do with uploading???!!!
 
no, more about handling metadata.  No, this is a metadata munging thing?
 
Jarek - I am working on uploading 22,000 images from web gallery of art. I downloaded all imgs.  I have a spreadhseet - metadata in a Commons-acceptable format.  templates for dates, technique, medium, etc.  Now I have a spreadsheet and a bunch of files on my computer. did not get it to work.  Learning Python.
 
 
MaZ: standards of metadata.... technical standards....
we can only influence in the longterm
 
 
 
 
 
(missed a little)
 
people started using standard formats
squeeze it into something else?  get garbage
 
 
there are diff WC templates -- each file has 1 template
several specialized formats - info for a photo, one is artwork for GLAM, also book templates.
 
 
template.... decides how metadata will be handle.... assume key-value pairs? doubles core, might have multiple entries with same key
 
 
Kaldari: we should probably - with a bot, you have a config file that maps to custom template for this collection. specify on the command line.
MaZ: or a static variable someplace
 
Jarek: headings somewhere....matches stuff in template... get columns in the CSV, put the right headings
 
MaD: yes, that's what I'm doing. if data is perfect, don't need to munge it, that's great.  if you need to massage, you need to do in code
 
Sumana: try out Google Refine.
 
Maarten: define transformation, like in date.... we can tell the bot to do that... in short term, we do not control metadata standard.  in longrun, we can standardize on adlib???? output standard -- adlib
 
We'll call it DATA INGESTION (;-) now. (as data ingestion covers metadata and objects)
 
 
Jarek: challenge of data ingestion: easier to decide on categories after upload than to think commons community will hop on your new 100k images & properly categorize each
so, if a tool could try to use controlled vocab to categorize....
 
templates..... [inaudible]
 
names of categories on commons are unpredictable
 
MaartenD: you can leave it for humans to do (small collections), or fully automate, as for geograph? or temp categories - make assumptions, have users move them to real ones after the temp/fake ones.  Sometimes a combo. users, bots, etc. combo works best.
 
Jarek: foodfight over bundesarchive - temp categories -- people got used to the temp categories & cat structure disappeared overnight
 
MaartenD: my fault! re templates
 
 
What to do in next 50 min?
 
[discussing creating templates]
 
Jarek: translation layer? translate data that comes from... for each image, ... give fields.... map it to fields in a standard, official template that's already there.
 
Dave: subject headings?  we have many images...not all the same topic.  Create a list of what subjects we already have?
 
Dutch - English localization, bleh, assigned temp categores, people could match those with the canonical categories, M made a list of most-used.
    people are still sorting that out more than a year later, 60K
 
MaartenZ: this system of making a pywikipedia bot that has a config file that says "this is the template you are going to use" based on an xml format....
 
If we make a framework, we can choose ingestion formats.  csv is best.  8 fields, simple, those are good. So let's define ingestion formats, a way to get the data it.
 
OAI-PMH and CSVs
 
will it be possible to update data after setting it once?
 
in another bot, we have standard functions that we will copy into this new bot, like to download a photo when you have the URL
to check for duplicates
to generate a description
 
 
"another bot" means what? where do i get that code?
 
Kaldari: what are the components we can break this into for people to work on?
 
 
how to handle things with templates, prob with multiple keys
 
MaZ: use an internal, basic metadata format... this is the URL, this is the title, etc.
(base it on dublin core?)
oEmbed has some lose standard for looking up some info about an asset
 
[diagram]
 
so, 1 component: convert from external to internal standard.
then, component 2, based on MaD's work
then, component 3, a config file, standard parameters: these are the categories it should be in
 
 
template in Maarten's userspace.....
 
 
a mapping....
 
"multiple keys"
a file with multiple authors.... the "author" field, you have it multiple times
unique subject means that Europana items with mult subjects, we throw away all but the last
 
table.... link to keys in the table? [inaud]
 
 
gui can be built upon it in the end, some web interface
 
Jarek was hoping: if a possible format is a Google spreadsheet, it can be worked on by community, not just 1 person
is Google spreadsheet the right place? they have a good API
Sumana: export to csv
MaD: if we structure this in such a way to get multiple input formats....
 
 
Danny: a converter....
 
we should provide a well-documented format we.... we should create a metadata format
 
key-value pairs & a few fields we really need: title, etc.
 
Dave: and license type.
 
 
MaZ: what about simple Dublin Core as a basic internal metadata format?
As the basis for naming the keys of our data ingestion
 
[naming things in such a way as to communicate easily, name them the same way in diff places]
 
Jarek: word template used in mult meanings
 
template as a substitute.... when you see final page, you don't see final format...
 
MaD: so we just need key value pairs.... and title
 
SimpleDublinCore: 15 basic fields, you can extend
 
Kaldari: what we're talking about makes sense as a longer term project.  May not make sense as a bot framework.  They have a standard based on this config template.  it gets fed in key-value pairs, 2-dimensions. We can do multidimensional data storage, which we can use dublin core????
 
 
[debate about dublin core, standard keys]
 
Kaldari: parameter names already set up in template, just for now.  Eventually, make something so they can go in & map their data in a hierarchical way, not just munging CSVs
 
 
MaD: confirm....
 
Danny: standards we are talking about: are we focusing on WCommons or Wikimedia projects, or trying to create a tool that would be usable on any single MW isntallation?
 
MaD: focus on Commons.
 
Jarek: someone else can grab 
 
Danny: I would prefer a blackbox doing all the core work, the WC output plugin, anyone else can use core tool, write other plugin to his own storage.  Wikia for example.
 
question about scope.
 
Right now, we are not going to do plugins for other repositories.
 
pywikipedia bot supports about 100 diff platforms, including Wikia
"200 families"???
 
MaD: out of scope for this project, but someone else could do it.
 
Sumana: from "you need to be a software developer to mod this" to "admin can change a config file" to "user can click around in a gui"
 
MaD: We are going from custom to more generic.
then we will take the next step to make it even easier.
expand the group of possible users/uploaders
make the config as easily accessible as possible, so people can see what you're doing & improvement
 
Jarek: good to allow more than 1 person to work on the metadata file
 
(more code in that tree)
 
Dave wants to look at a big collection of metadata.  MaD suggests he look at ....
 
 
MaZ: if we are thinking in modules, then the big black box in the middle
.... dictionary modules?
 
put dict with 2 variables.  dictionary & configuration.....
 
associative array.... call 1 "metadata" & one "configuration"
 
Description of work
 
We are going to make 3 modules
 
1. Upload module
MaartenD is going to make a function 
put(metadata, configuration);
with metadata as a dict (python for associative array)
add duplicate checker
 
2. Conversion module / interface module
Make a metadata conversion module that ingests CVSs/OAI-PMH and converts to internal dict format add a GUI that create the 2 dicts necessary (metadata, configuration)
 
3. Develop a configuration standard as an array of keys for a dict.
 
draft of standard:
configurationTemplate: holds template url
ConfigurationTitleTemplate
 
 
=== Saturday ===
 
discussion about how the old upload scripts were hacked together
 
We were talking yesterday about putting config also online, on the wiki
you start the bot and ... give it a line, where the config file lives on Commons, and the config file includes URLs
so we should make a template on Commons that ... contains config fields
sample config files?
not configurable currently, so Maarten D is coming up with what the required fields are
 
MaZ:  so if you have a bunch of data, it doesn't matter whether it is XML or  CSV or what. convert it into a dict, key-value configurations
 
MaD: single run, loops over all the items, for each item you have a dict.  A longlist with dictionaries as keyvalues
 
name of the keys?
 
Why can't people just make their own template?  .... we could make it easier
hardest thing is translation from external format to a dictionary of dictionaries of items
let's do that first?
 
Maarten has the lib he's building.
 
Jarek volunteers for template work.  Maybe attack templates for creators.
 
discussion of XML vs JSON
 
We have to assume they will have arbitrary key values
 
If we're going to convert their field names into our own dict, why not just go ahead & build template with those values, so as not to do a 2nd conversion.  Where do we config? in the template?...
 
[misunderstanding -- where is the config, what redundant work would be]
"show me an example you want to do"
 
if their title is zygote.... put in the parameter as zygote.
Do we need to make a program for that? most commoners that work with GLAMs know templates, so you don't have to make a tool that helps for that, because they already know it
So why do we need to create a dict?
That lib needs to know what matches what? no, just throws it to the template.  Know what to send where?
No... just throws entire item dict with... zygote.... doesn't matter, throws entire item to the template ... template handles it via regex or something
 
so, make the social assumption that people will put things in the right categories, instead of doing the technical work to get rid of that assumption
 
Kaldari was gonna write a PHP script to create the templates
"in my data, title = zygote".... 
MaZ: if you want to do that, sure
....
 
Resolution: someone needs to write dictionary outputter, won't be Kaldari or MaZ
 
converter from XML to Dict? we already have that
 
CSV is built-in in Python
 
there's a prob with their data, all their URLs refer to webpages instead of images
 
another custom script to scrape images from page
 
no, images are separate
 
Kaldari's task: write script to get URIs for a bunch of images
upload all your local images to toolserver, give you toolserver local URIs
 
Yay, yes, it is desired to make it easy for people to get the images onto toolserver
IT IS CONFIRMED, They set up us the toolserver
 
Leonard Richardson is here, improving Beautiful Soup so it can use Python 3
 
question re config templates -- how to call them? what name?
    parse URI?
    templates -- I just call "data ingestion template" or something like that.....
    isn't there gonna be a diff one?
    this is the one for configuration things
 
Dispenser says he wrote a wrapper that makes Python's diff libr output -- its HTML diff -- look like MediaWiki's diff
    <Dispenser> Here's an example of it in action:
 
 
how do we know that the database name is correct?
 
 
[quiet hacking at 12:15pm localtime]
 
12:40pm local - going to lunch, back in 90 minutes
 
2:13pm local - back from lunch, but discussing Wiki Loves Monuments stuff - http://etherpad.wikimedia.org/GLAMcampNYCsat
 
commiserating over encoding